Large language models (LLMs) are becoming increasingly popular, with companies like OpenAI and Microsoft releasing new impressive natural language processing (NLP) systems. However, recent research by Epoch shows that we might soon need more data for training AI models. The team investigated the amount of high-quality data available on the internet, and the analysis shows that high-quality data will be exhausted soon, likely before 2026. While the sources for low-quality data will be exhausted only decades later, it’s clear that the current trend of endlessly scaling models to improve results might slow down soon.
Machine learning (ML) models improve their performance with an increase in the amount of data they are trained on. However, simply feeding more data to a model is not always the best solution, especially in the case of rare events or niche applications. This suggests that if we want to keep technological development from slowing down, we need to develop other paradigms for building machine learning models that are independent of the amount of data.
Alternative Solutions to Simply Feeding More Data to a Model
Scaling machine learning models presents a significant challenge due to the diminishing returns of increasing model size. As a model’s size continues to grow, its performance improvement becomes marginal, and it becomes harder to optimize and more prone to overfitting. Additionally, larger models require more computational resources and time to train, making them less practical for real-world applications.
One approach to overcoming this problem is to reconsider what we consider high-quality and low-quality data. Creating more diversified training datasets could help overcome the limitations without reducing quality. Training the model on the same data more than once could also reduce costs and reuse the data more efficiently. However, this approach could postpone the problem, as the more times we use the same data to train our model, the more it is prone to overfitting.
Another alternative solution is to use machine learning approaches that differ from traditional methods, such as JEPA (Joint Empirical Probability Approximation). JEPA uses empirical probability distributions to model the data and make predictions, handling complex, high-dimensional data and adapting to changing data patterns.
Data augmentation techniques can also be used to modify existing data to create new data, reducing overfitting and improving a model’s performance. Transfer learning involves using a pre-trained model and fine-tuning it to a new task, saving time and resources as the model has already learned valuable features from a large dataset.
While we can still use data augmentation and transfer learning today, these methods don’t solve the problem once and for all. Effective methods that could help us overcome the issue in the future need to be developed. After all, for a human, it’s enough to observe just a couple of examples to learn something new. Maybe one day, we’ll invent AI that will be able to do that too.