Research & publications

I am interested in the human language system, the foundations of language technology, human-AI interaction, and the societal impact of language technology.

My current research line focuses on AI’s linguistic fingerprint and AI-associated language change: how model language behaviour diverges from human language behaviour, and how the two interact. This work connects to broader questions in model alignment, including how models come to reflect, amplify, or depart from human expectations, and what their language use reveals about their inner workings.

Below is a selection of recent projects, grouped by theme.

An AI global fingerprint

Across languages, ChatGPT leaves a kind of cross-lingual fingerprint: it overuses the same concepts in dozens of the world’s languages. And exactly those overused words have risen markedly in human usage since ChatGPT’s release in 2022.

AI-Associated Lexical Shifts Across 34 Languages: Cross-Lingual Convergence and Diachronic Uptake in News Writing

Juzek, T. S. · preprint, 2026

A cross-lingual “AI register”: emphasize-type verbs surface among AI-overused words in 24 of 34 languages; AI-preferred words rise +15.1% in post-ChatGPT news vs −4.5% for matched baselines.

arXiv GitHub

@misc{juzek-2026-34languages,
  title         = {AI-Associated Lexical Shifts Across 34 Languages:
                   Cross-Lingual Convergence and Diachronic Uptake in News Writing},
  author        = {Juzek, Thomas Stephan},
  year          = {2026},
  eprint        = {2605.25358},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2605.25358}
}

The same trend appears in speech. Testing (semi-)spontaneous spoken English, we find that AI-overused words have increased markedly from before to after ChatGPT’s release.

Model Misalignment and Language Change: Traces of AI-Associated Language in Unscripted Spoken English

Anderson, B., Galpin, R. & Juzek, T. S. · AIES 2025

The first peer-reviewed evidence of AI-associated vocabulary appearing in unscripted spoken English.

Proceedings arXiv GitHub

@inproceedings{anderson-etal-2025-misalignment,
  title     = {Model Misalignment and Language Change: Traces of
               AI-Associated Language in Unscripted Spoken English},
  author    = {Anderson, Bryce and Galpin, R. and Juzek, Thomas Stephan},
  booktitle = {Proceedings of the AAAI/ACM Conference on AI, Ethics,
               and Society (AIES 2025)},
  volume    = {8},
  number    = {1},
  pages     = {179--191},
  year      = {2025},
  doi       = {10.1609/aies.v8i1.36540},
  url       = {https://doi.org/10.1609/aies.v8i1.36540}
}

The why

But why does ChatGPT overuse these words? For Scientific English, the literature has documented striking lexical shifts and conjectured that AI is the cause. We strengthen that link (the words that have been spiking are also the ones AI overuses), and then examine the mechanisms, finding that learning from human preferences (RLHF) plays an important role. Two papers explore this.

Why Does ChatGPT “Delve” So Much? Exploring the Sources of Lexical Overrepresentation in Large Language Models

Juzek, T. S. & Ward, Z. B. · COLING 2025

Why LLMs overuse words like delve, underscore, and intricate: identifying focal words whose rise in scientific abstracts tracks LLM use.

Proceedings arXiv GitHub

@inproceedings{juzek-ward-2025-delve,
  title     = {Why Does {ChatGPT} ``Delve'' So Much? Exploring the Sources
               of Lexical Overrepresentation in Large Language Models},
  author    = {Juzek, Thomas Stephan and Ward, Zina B.},
  booktitle = {Proceedings of the 31st International Conference on
               Computational Linguistics (COLING 2025)},
  year      = {2025},
  url       = {https://aclanthology.org/2025.coling-main.426/}
}

Word Overuse and Alignment in Large Language Models: The Influence of Learning from Human Feedback

Juzek, T. S. & Ward, Z. B. · BIAS 2025 Workshop (ECML PKDD)

Links AI lexical overuse to learning from human feedback: annotators systematically prefer text containing certain words, and preference training amplifies them.

arXiv GitHub

@inproceedings{juzek-ward-2025-overuse,
  title     = {Word Overuse and Alignment in Large Language Models:
               The Influence of Learning from Human Feedback},
  author    = {Juzek, Thomas Stephan and Ward, Zina B.},
  booktitle = {Proceedings of the BIAS Workshop, ECML PKDD 2025},
  year      = {2025},
  url       = {https://arxiv.org/abs/2508.01930}
}

Model diagnostics

Much of the literature on AI and language still depends on hand-curated word lists at some step. We automate the diagnostics for model language behaviour and alignment, with no manual curation required.

Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models

Juzek, T. S., Ming, X. & Hernandez, J. A. · LREC 2026, 6116–6131

A curation-free pipeline (Lexical Alignment Score plus a triangulated preference shift) that derives AI-overused word inventories without manual lists: the foundational diagnostic for this research line.

Proceedings GitHub

@inproceedings{juzek-etal-2026-fully,
  title     = {Fully Automated Identification of Lexical Alignment and
               Preference-Stage Shifts in Large Language Models},
  author    = {Juzek, Thomas Stephan and Ming, Xiaoyang and Hernandez, Jose A.},
  booktitle = {Proceedings of the Fifteenth Language Resources and
               Evaluation Conference (LREC 2026)},
  pages     = {6116--6131},
  year      = {2026},
  doi       = {10.63317/4ut7ammh7z3h},
  url       = {https://lrec.elra.info/lrec2026-main-484}
}

For the complete, up-to-date list with citation counts (including earlier work on experimental syntax, acceptability, and corpora), see Google Scholar.