wordfreq is an open-source project that allows anyone to look up frequency of word usage across 40 languages. Or it was—project owner Robyn Speer has announced that she is sunsetting the project. It previously used data from Google Books Ngrams, Wikipedia, and in particular a web text corpus called OSCAR.
Speer’s contention is that OSCAR is poisoned:
The open Web (via OSCAR) was one of wordfreq's data sources. Now the Web at large is full of slop generated by large language models, written by no one to communicate nothing. Including this slop in the data skews the word frequencies.
Sure, there was spam in the wordfreq data sources, but it was manageable and often identifiable. Large language models generate text that masquerades as real language with intention behind it, even though there is none, and their output crops up everywhere.
As one example, Philip Shapira reports that ChatGPT (OpenAI's popular brand of generative language model circa 2024) is obsessed with the word "delve" in a way that people never have been, and caused its overall frequency to increase by an order of magnitude.
Ironically, Speer could be wrong—perhaps the upshot is not so great that the OSCAR data remains meaningful. The real problem is that we have no way of knowing. There’s no tag on AI-generated “slop” that marks it as such, so even gauging its skewing impact on the data becomes impossible through direct means. The degree of the effect can be indirectly approximated by sampling and such, but not in ways that would allow you to de-gunk the overall data.
(You could always try to train an AI to de-gunk data by removing suspected AI-generated data, but there starts an infinite regress that seems unlikely to lead anywhere good.)
The “delve” example is one sort of approximation. Shapira’s analysis doesn’t leave much room for doubt:
By the end of March 2024, the percentage of “delve” papers for 2024 was already far higher than that of 2023, signaling a continual increase in delve-ness. So here we have a datapoint indicating some real pressure on “delve” usage from LLMs. What are we to do about it? It is useful as an indicator for identifying specific papers as LLM-addled, but for large-scale analysis, it doesn’t give us a way forward.
Even talking about it—as in this post—only makes the problem worse. I didn’t use ChatGPT to write this post, but from all the “delve”-ing going on, you’d be forgiven for thinking I did.
The sunsetting of wordfreq is one recognition of the unavoidable fact that by building out these AI homunculi that contribute reams of unthought content to the collective online store of data, we are changing the nature of what it is to be human. It’s not just that AIs tend to standardize on what they deem to be the most “appropriate” modes of expression based on training where the majority will tend to rule the day, but also that their unpredictable peculiarities
My position on AI was and remains that the current bleeding-edge LLM AIs do nothing that can be thought of as thinking and are no more than massive, advanced versions of Clever Hans, responding to prompts and cues and processing them in mechanical (though stochastic) fashion. Their achievement is in showing just how convincing and lifelike such technology can be, but we shouldn’t conflate their usefulness (because they are useful) with actual thinking or overestimate the limits of their capabilities. However much text it produces, there is no way in which ChatGPT knows what a prime number is, knows what a word means, or knows how you feel. It doesn’t know or think anything in any traditional sense of that word, nor does it dream or hallucinate or do anything mental.
What AIs do manage, however, is to affect the space of human thinking, and quite drastically. Any large-scale change to the human information environment does it: the printing press did it, radio and television did it, the internet did it. AI’s new wrinkle is that it introduces genuinely novel content into the equation, influenced but not authored by actual human beings. Previous technologies shaped human thought and society by changing how human-generated content was created and distributed. Machines could assist, but for the most part, there was some human intention behind every bit of content.
AI removes that guarantee. John Cage was preoccupied with removing the element of human intention behind artistic creation—it turns out he need only have waited. Where the popularity of “delve” previously reflected some large-scale human social trend, whatever it might have been, now we have to account for a feedback-driven process by which LLMs inject themselves indistinguishably into quantitative analyses of content. Even this very post is influenced by LLMs, echoing their preoccupation with “delve,” so it too is tainted.
And as AI-generated content is irreversibly mixed with human-generated text, the ability to tease out what humans are doing with text vs. what AI does with it will become impossible to measure in most large domains.
There are two ways of looking at the inseparable mixing of human and non-human data. The first is that we are losing the ability to know ourselves through statistical analysis, because the underlying data no longer reflect who we are. The second is that this is in fact who we are now, and that LLMs and other coming AI forms will inevitably and inextricably constitute a significant part of how we see ourselves. We may not be cyborgs, but we can only see ourselves as though we were. The Turing Test turns out to have a third answer: it’s not a human or a machine we’re talking to, but both.