Why AI Uses Em Dashes Why AI Uses Em Dashes
AI models use em dashes constantly. The reason traces to Victorian novels, destroyed books, and a training data arms race. Here is what actually happened.
Most people who spotted the pattern did not know what to call it. They just knew something felt off. Then in April 2025, Rolling Stone named it: the “ChatGPT hyphen.”
Here is what that looks like in practice. A typical AI-written sentence might read: “The report was finished on time — despite the delays — and the client was pleased.” A human writer in 2025 would more likely write: “The report was finished on time, even with the delays, and the client was pleased.” Same meaning. Different feel. The dash version has a rhythm that people now recognise instantly, even if they cannot explain why.
Within weeks of Rolling Stone’s piece, the em dash was everywhere: Reddit threads, campus plagiarism hearings, brand post-mortems. One caption. One deleted post. One very old mark that had somehow become a tell.
The reason AI uses em dashes so often has nothing to do with how machines think. It has to do with which books they read, and what some companies did to get them. To understand that, you need to know where the mark came from, because the history explains everything that followed.
According to a February 2026 episode of 99% Invisible, the em dash traces to an 11th-century Italian scholar named Boncompagno da Signa. He wanted a mark that could hold a thought mid-sentence. A pause that was not quite a stop. He called it the Virgula Plana. It spread through old manuscripts, then into print. By the 1600s, editions of Shakespeare used long dashes to show when a character was cut off mid-speech. By the Victorian era, the mark was everywhere.
Charlotte Brontë used one every 90 words in Jane Eyre. Herman Melville, one every 129 words in Moby-Dick. Per the same 99% Invisible episode, Emily Dickinson filled roughly 1,800 poems with thousands of dashes. For her, the mark was not just punctuation. It was how she captured the way thought moves fast, then stops, then shifts.
Then the typewriter arrived and made the em dash physically awkward to produce. Writers switched to two hyphens side by side. Word processors later fixed this with autocorrect. But decades of simpler typing habits had already made the em dash rare in everyday writing. Most people stopped seeing it. That is exactly why it looks so strange when AI produces it constantly today.
GPT-3.5, launched in late 2022, was trained mostly on internet text. Blog posts, forums, scraped web pages. It did not use em dashes much. Then the race for better training data began.
An independent researcher named Sean Goedecke published an analysis with a clear finding: books from the late 1800s and early 1900s use about 30 percent more em dashes than modern writing does. His argument was simple. As AI labs moved toward scanning older, higher-quality books, they picked up the punctuation habits of writers from that era. The numbers back it up. GPT-4o, launched in 2024, used about 10 times more em dashes than GPT-3.5, per Goedecke’s analysis. Between those two models, the training data had changed. Old books had entered the picture in a serious way.
Which raises an obvious question: where exactly did all those books come from.
In January 2026, The Washington Post reported on court documents from a copyright case against Anthropic, the company behind the AI assistant Claude. The documents revealed a project the company had kept quiet. They called it “Project Panama.” The goal, written in the company’s own records, was to “destructively scan all the books in the world.” Anthropic bought used books by the tens of thousands. A machine cut off the spines. The pages were scanned into digital files. A former Google Books executive was brought in to run it. One of Anthropic’s founders had written in a 2023 internal note that training on books would teach the model to “write well,” instead of copying what he called “low-quality internet speech.”
Per NPR’s September 2025 reporting, a federal judge found that Anthropic had also downloaded over 7 million digitised books it knew had been pirated, from sites including Library Genesis. The company later settled a copyright case brought by book authors for 1.5 billion dollars, roughly 3,000 dollars per book.
The books went in. The writing style came out. The em dash came with it.
But the internet’s broad accusation gets harder to hold once you look at the actual numbers. In June 2025, Plagiarism Today tested six AI tools with the same prompt. ChatGPT, Microsoft Copilot, and DeepSeek used em dashes throughout. Claude used two. Gemini and Meta AI used none. A March 2026 research paper on arXiv by E. M. Freeburg tested 12 models from five companies. When the models were told to avoid formatting, bullet points and headers disappeared. The em dash stayed in most of them anyway. Meta’s Llama models were the only ones that produced zero em dashes, even without being told to.
Freeburg’s paper adds another angle. The em dash may be what he calls “markdown leaking into prose.” When a model is trained heavily on structured text, it picks up a habit of organising and separating ideas. Tell it to drop the obvious formatting and most of it goes. But the em dash, which sits right at the line between punctuation and structure, tends to survive. Both explanations, the old books theory and the markdown theory, may be true at the same time.
None of this would matter much if the only cost were a deleted brand caption. But real people have been hurt by the confusion. The Loyola Phoenix reported in October 2025 that a student submitted a paper she had written herself. It was flagged as AI-generated. The reason, per the report: em dashes. She had not used an AI. She had simply learned to write in a way that now looked like a machine had done it.
Daphne Ippolito, a senior scientist at Google Brain, told MIT Technology Review that the em dash is not a solid way to detect AI writing. Better signs, she said, include the absence of typos and odd patterns in word choice. The em dash problem is really a different problem: people looking for easy shortcuts instead of actually reading carefully.
On November 14, 2025, OpenAI CEO Sam Altman posted on X to say ChatGPT would now follow user instructions to avoid em dashes. We tested it on May 6, 2026. After adding a simple instruction in ChatGPT’s custom settings telling it not to use em dashes, the mark disappeared across multiple prompts. The fix works, at least for now. But it is not automatic. New users still get em dashes unless they know to ask for something different. The underlying training has not changed.
AI started sounding too literary, not too robotic. Labs spent hundreds of millions of dollars on 19th-century prose to make their models sound smarter. That old prose carried habits that most people stopped using over a century ago. A cluster of em dashes is a reason to look more closely at a piece of writing. It is not proof of anything.
The student whose paper was flagged already knew that. So did the writers who spent all of 2025 defending a punctuation mark they had used for years before any AI model existed.