Thomas Stephan Juzek

Research & publications

I am interested in the human language system, the foundations of language technology, human-AI interaction, and the societal impact of language technology.

My current research line focuses on AI’s linguistic fingerprint and AI-associated language change: how model language behaviour diverges from human language behaviour, and how the two interact. This work connects to broader questions in model alignment, including how models come to reflect, amplify, or depart from human expectations, and what their language use reveals about their inner workings.

Below is a selection of recent projects, grouped by theme.

An AI global fingerprint

Across languages, ChatGPT leaves a kind of cross-lingual fingerprint: it overuses the same concepts in dozens of the world’s languages. And exactly those overused words have risen markedly in human usage since ChatGPT’s release in 2022.

AI-Associated Lexical Shifts Across 34 Languages: Cross-Lingual Convergence and Diachronic Uptake in News Writing

Juzek, T. S. · preprint, 2026

A cross-lingual “AI register”: emphasize-type verbs surface among AI-overused words in 24 of 34 languages; AI-preferred words rise +15.1% in post-ChatGPT news vs −4.5% for matched baselines.

The same trend appears in speech. Testing (semi-)spontaneous spoken English, we find that AI-overused words have increased markedly from before to after ChatGPT’s release.

Model Misalignment and Language Change: Traces of AI-Associated Language in Unscripted Spoken English

Anderson, B., Galpin, R. & Juzek, T. S. · AIES 2025

The first peer-reviewed evidence of AI-associated vocabulary appearing in unscripted spoken English.

The why

But why does ChatGPT overuse these words? For Scientific English, the literature has documented striking lexical shifts and conjectured that AI is the cause. We strengthen that link (the words that have been spiking are also the ones AI overuses), and then examine the mechanisms, finding that learning from human preferences (RLHF) plays an important role. Two papers explore this.

Why Does ChatGPT “Delve” So Much? Exploring the Sources of Lexical Overrepresentation in Large Language Models

Juzek, T. S. & Ward, Z. B. · COLING 2025

Why LLMs overuse words like delve, underscore, and intricate: identifying focal words whose rise in scientific abstracts tracks LLM use.

Word Overuse and Alignment in Large Language Models: The Influence of Learning from Human Feedback

Juzek, T. S. & Ward, Z. B. · BIAS 2025 Workshop (ECML PKDD)

Links AI lexical overuse to learning from human feedback: annotators systematically prefer text containing certain words, and preference training amplifies them.

Model diagnostics

Much of the literature on AI and language still depends on hand-curated word lists at some step. We automate the diagnostics for model language behaviour and alignment, with no manual curation required.

Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models

Juzek, T. S., Ming, X. & Hernandez, J. A. · LREC 2026, 6116–6131

A curation-free pipeline (Lexical Alignment Score plus a triangulated preference shift) that derives AI-overused word inventories without manual lists: the foundational diagnostic for this research line.

For the complete, up-to-date list with citation counts (including earlier work on experimental syntax, acceptability, and corpora), see Google Scholar.