As more people use AI to create and publish content, a group of researchers from the UK and Canada have raised concerns about the potential implications of AI-generated content proliferating around the internet. Using AI to train AI models could have irreversible impacts on generative AI technology and its future. The researchers found that the learning process from data produced by other AI models causes model collapse, a degenerative process in which, over time, models forget the true underlying data distribution, resulting in distortions and misperceptions of reality.
Model collapse occurs when the data AI models generate ends up contaminating the training set for subsequent models. Generative models tend to overfit for popular data and often misunderstand or misrepresent less popular data, leading them to progressively lose minority data characteristics. This accumulation of AI-generated data eventually results in models gaining a distorted perception of reality, producing more errors and repetitions, and ultimately reducing the non-erroneous variety in the responses and content they generate.
However, there are ways to prevent model collapse with existing transformers and large language models (LLMs). One way is to retain a prestige copy of the original dataset and periodically retrain the model on this data or refresh it entirely, starting from scratch. Another option is to reintroduce new, clean, human-generated datasets back into the models' training. This would require a mass labeling effort to differentiate between AI-generated and human-generated content, which is not yet available on a large scale. These findings highlight the need for improved methodologies to maintain the integrity of generative models over time, and the importance of human-generated content in training AI.