Research & publications
I am interested in the human language system, the foundations of language
technology, human-AI interaction, and the societal impact of language technology.
My current research line focuses on AI’s linguistic fingerprint and
AI-associated language change: how model language behaviour diverges from
human language behaviour, and how the two interact. This work connects to broader questions in
model alignment, including how models come to reflect, amplify, or depart from human
expectations, and what their language use reveals about their inner workings.
Below is a selection of recent projects, grouped by theme.
An AI global fingerprint
Across languages, ChatGPT leaves a kind of cross-lingual fingerprint: it overuses the same
concepts in dozens of the world’s languages. And exactly those overused words have risen
markedly in human usage since ChatGPT’s release in 2022.
AI-Associated Lexical Shifts Across 34 Languages: Cross-Lingual Convergence and Diachronic Uptake in News Writing
Juzek, T. S. · preprint, 2026
A cross-lingual “AI register”: emphasize-type verbs surface among AI-overused words in 24 of 34 languages; AI-preferred words rise +15.1% in post-ChatGPT news vs −4.5% for matched baselines.
@misc{juzek-2026-34languages,
title = {AI-Associated Lexical Shifts Across 34 Languages:
Cross-Lingual Convergence and Diachronic Uptake in News Writing},
author = {Juzek, Thomas Stephan},
year = {2026},
eprint = {2605.25358},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2605.25358}
}
The same trend appears in speech. Testing (semi-)spontaneous spoken
English, we find that AI-overused words have increased markedly from before to after
ChatGPT’s release.
Model Misalignment and Language Change: Traces of AI-Associated Language in Unscripted Spoken English
Anderson, B., Galpin, R. & Juzek, T. S. · AIES 2025
The first peer-reviewed evidence of AI-associated vocabulary appearing in unscripted spoken English.
@inproceedings{anderson-etal-2025-misalignment,
title = {Model Misalignment and Language Change: Traces of
AI-Associated Language in Unscripted Spoken English},
author = {Anderson, Bryce and Galpin, R. and Juzek, Thomas Stephan},
booktitle = {Proceedings of the AAAI/ACM Conference on AI, Ethics,
and Society (AIES 2025)},
volume = {8},
number = {1},
pages = {179--191},
year = {2025},
doi = {10.1609/aies.v8i1.36540},
url = {https://doi.org/10.1609/aies.v8i1.36540}
}
The why
But why does ChatGPT overuse these words? For Scientific English, the literature has
documented striking lexical shifts and conjectured that AI is the cause. We strengthen that
link (the words that have been spiking are also the ones AI overuses), and then examine the
mechanisms, finding that learning from human preferences (RLHF) plays an important role. Two
papers explore this.
Why Does ChatGPT “Delve” So Much? Exploring the Sources of Lexical Overrepresentation in Large Language Models
Juzek, T. S. & Ward, Z. B. · COLING 2025
Why LLMs overuse words like delve, underscore, and intricate: identifying focal words whose rise in scientific abstracts tracks LLM use.
@inproceedings{juzek-ward-2025-delve,
title = {Why Does {ChatGPT} ``Delve'' So Much? Exploring the Sources
of Lexical Overrepresentation in Large Language Models},
author = {Juzek, Thomas Stephan and Ward, Zina B.},
booktitle = {Proceedings of the 31st International Conference on
Computational Linguistics (COLING 2025)},
year = {2025},
url = {https://aclanthology.org/2025.coling-main.426/}
}
Word Overuse and Alignment in Large Language Models: The Influence of Learning from Human Feedback
Juzek, T. S. & Ward, Z. B. · BIAS 2025 Workshop (ECML PKDD)
Links AI lexical overuse to learning from human feedback: annotators systematically prefer text containing certain words, and preference training amplifies them.
@inproceedings{juzek-ward-2025-overuse,
title = {Word Overuse and Alignment in Large Language Models:
The Influence of Learning from Human Feedback},
author = {Juzek, Thomas Stephan and Ward, Zina B.},
booktitle = {Proceedings of the BIAS Workshop, ECML PKDD 2025},
year = {2025},
url = {https://arxiv.org/abs/2508.01930}
}
Model diagnostics
Much of the literature on AI and language still depends on hand-curated word lists at some
step. We automate the diagnostics for model language behaviour and alignment, with no manual
curation required.
Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models
Juzek, T. S., Ming, X. & Hernandez, J. A. · LREC 2026, 6116–6131
A curation-free pipeline (Lexical Alignment Score plus a triangulated preference shift) that derives AI-overused word inventories without manual lists: the foundational diagnostic for this research line.
@inproceedings{juzek-etal-2026-fully,
title = {Fully Automated Identification of Lexical Alignment and
Preference-Stage Shifts in Large Language Models},
author = {Juzek, Thomas Stephan and Ming, Xiaoyang and Hernandez, Jose A.},
booktitle = {Proceedings of the Fifteenth Language Resources and
Evaluation Conference (LREC 2026)},
pages = {6116--6131},
year = {2026},
doi = {10.63317/4ut7ammh7z3h},
url = {https://lrec.elra.info/lrec2026-main-484}
}
For the complete, up-to-date list with citation counts (including earlier
work on experimental syntax, acceptability, and corpora), see
Google Scholar.