AI systems risk deteriorating into nonsense as increasing amounts of internet content are generated by artificial intelligence, researchers warn. Recent advancements in text-generating systems like OpenAI’s ChatGPT have spurred excitement and a surge in AI-generated content on the internet. However, these systems are trained using text from the internet, potentially creating a feedback loop where AI-produced text trains future AI models.
This cycle could rapidly degrade AI tools into producing gibberish, according to a new study “AI models collapse when trained on recursively generated data,” published in Nature.
Concerns are rising about the “dead internet theory,” which posits that the web is becoming increasingly automated, creating a vicious cycle. The research shows that it only takes a few iterations of generating and then retraining on that content for AI systems to produce nonsensical outputs. For instance, one system tested with text on medieval architecture needed just nine generations before devolving into repetitive, irrelevant lists.
This phenomenon, termed “model collapse,” occurs when AI systems are trained on datasets that include AI-generated content, leading to increasingly corrupted outputs. As AI produces more data and retrains on it, less common data gets omitted, resulting in less diverse outputs. For example, a system trained on images of dog breeds may eventually only generate pictures of the most common breed if retrained on its own outputs, eventually collapsing into irrelevance.
The same effect impacts large language models like ChatGPT and Google’s Gemini. This could render the systems not only useless but also less reflective of the world’s diversity, potentially erasing smaller groups or perspectives.
To sustain the benefits of training from large-scale web data, the issue must be addressed. Companies that previously scraped data might benefit from having more genuine human content. Potential solutions include watermarking AI-generated content to filter it from training sets, but implementation challenges remain.