In recent discussions among experts, a new concern has emerged regarding the future of artificial intelligence (AI). This concern, known as “model collapse,” is gaining attention as the use of AI continues to expand. Model collapse refers to a situation where AI systems, which depend heavily on large amounts of data for their learning, begin to degrade in performance because they are increasingly trained on data generated by other AI systems rather than by humans. Insightful post! Model collapse seems concerning. Does injecting synthetic data degrade training over time? — Vincent Valentine (CEO of Cognitive.ai) (@BitValentine) August 19, 2024 Today, AI systems rely on machine learning, a process that allows the system to learn by identifying patterns in data. However, the effectiveness of this learning is highly dependent on the quality of the data. The current generation of AI models, such as those developed by companies like OpenAI, Google, Meta, and Nvidia, require vast amounts of high-quality data to function correctly. Historically, this data has been sourced from human-generated content found across the internet. I disagree, model collapse is not a reason to be skeptical of AI’s future. Instead, it’s an opportunity to develop more robust and reliable models. — FutureTech Dave (@aitech_today) August 19, 2024 This shift has raised concerns among researchers who started questioning whether it is possible to train AI systems effectively using only AI-generated data. The appeal of this approach lies in the cost savings and legal simplicity of using AI-made content instead of human-created data. Yet, the findings have been worrying. they won’t collapse, Ed – but if they don’t get funding, they can choose to seriously curtail their growth trajectory. They’re not going to be able to raise the (many) billions they need at the valuations Sam et al think they deserve — M. (@Butter___Man) July 29, 2024 As each new model is trained on data that includes AI-generated content, the diversity and accuracy of the model’s outputs decrease. In practical terms, the AI becomes less capable of producing useful, varied, and accurate responses. One of the major challenges in avoiding model collapse is the difficulty in filtering out AI-generated content from the data used to train new AI models. Couldn’t they take a down round? — Brandon Scott Barney (@freeredpill) July 30, 2024 However, as AI-generated content becomes more prevalent, distinguishing between human-created and AI-generated data will become increasingly challenging, making this task even more resource-intensive. Despite these concerns, human-generated data remains indispensable for AI systems. It provides the richness and diversity that AI needs to mimic human understanding and behavior. They will raise tens of billions and the investors will not even blink to that. If anything can stop them is legal and regulatory framework which will make their prospects of earnings less attractive. — Max Coveri (@MCoveri76544) July 29, 2024 This is particularly concerning given the estimates that the available pool of human-generated text data could be exhausted as early as 2026. In response to these challenges, companies like OpenAI are rushing to secure exclusive partnerships with major content providers such as Shutterstock, Associated Press, and NewsCorp. They will find people willing to give them cash over the next couple of years. The AI business is going through a Tulip Mania bubble, fuelled especially by being too complicated for investors to understand. I can’t predict how long it will last, but it could be years. — RichC: check your voter registration (@RCownie) July 29, 2024 However, the threat of a catastrophic model collapse may be somewhat exaggerated. Current research suggests that as long as human and AI data accumulate together, the likelihood of a total collapse is reduced. Furthermore, a future in which various AI platforms create and publish content, rather than a single dominant model, could help maintain the robustness of AI systems. I hope it does! We don’t need it — TerriSane (@SaneTerri) July 30, 2024 Additionally, funding public interest technology development could help maintain the quality of AI systems. Beyond the technical aspects, broader societal concerns are related to the rise of AI-generated content. While an abundance of synthetic content might not directly threaten the development of AI, it could undermine the digital public good of the internet. Comment For example, research has shown a great decrease in user activity on platforms like StackOverflow following the release of AI tools like ChatGPT. This suggests that AI may already reduce direct human interactions in online communities. Moreover, the proliferation of AI-generated content is making it increasingly difficult to find information online that isn’t clickbait or stuffed with advertisements. This trend raises questions about the future of the Internet as a space for genuine human interaction and knowledge sharing. Comment One proposed solution is to watermark or label AI-generated content, making it easier to distinguish from human-created material. This idea has been reflected in recent Australian government legislation and is gaining traction among experts. Another concern is the potential loss of socio-cultural diversity as AI-generated content becomes more homogeneous. If AI systems continue to produce outputs that lack cultural and social variety, there is a risk that certain groups may experience cultural erasure. To address these challenges, there is an urgent need for cross-disciplinary research examining AI’s social and cultural implications. Comment While the concept of model collapse presents a serious challenge to the future of AI, it is not insurmountable. By prioritizing human-generated data, fostering competition within the AI industry, and addressing the broader societal impacts of AI, companies can ensure that these systems remain effective and beneficial. For more news and insights, visit AI News on our website.
This feedback loop can lead to a major decline in the quality and diversity of AI outputs.
However, there has been a noticeable shift since the widespread adoption of generative AI technologies in 2022. Increasingly, AI creates the content uploaded to the internet, at least in part.
When AI systems are trained on data generated by previous AI models, their performance degrades over time. This phenomenon, often compared to the genetic problems caused by inbreeding, leads to what experts call “regurgitative training.”
Tech companies already invested considerable resources into cleaning and filtering the data they scrape from the internet. Sometimes, they discard as much as 90% of the data they collect.
The reliance on AI-generated data alone could lead to a scenario where AI models lose their ability to reflect the complexity and variety of human culture and language.
These partnerships allow them to access large collections of human-generated data that are not publicly available online. Such moves are seen as essential to prevent model collapse and ensure that AI models remain effective.
Regulatory measures could also play a crucial role in preventing model collapse. By promoting competition and limiting monopolies in the AI sector, regulators can encourage the development of diverse AI models, reducing the risk of collapse.
byu/dissolutewastrel from discussion
inscience
byu/dissolutewastrel from discussion
inscience
byu/dissolutewastrel from discussion
inscience
AI’s Darkest Hour: What Is ‘Model Collapse’ and Should We Be Worried?
Key Takeaways:
Was this article helpful?
YesNo