With millions of dollars invested in AI development, it is widely accepted that AI, at the rate at which it is scaling, will become super-intelligent quicker than we thought. After all, since the release of GPT-3, we have seen multiple AI labs with numerous AI products emerge—some successful, some not so much. But all that has come to a standstill. The reason? Data scarcity.
For AI to become super-intelligent and start solving the mysteries of the universe, it needs trillions of structured and labeled data points. And we are far from that. Recent news surrounding OpenAI’s delay in its next-generation flagship LLM, Orion, as reported by the tech publication The Information, suggests that Orion shows only small gains over its predecessors and is unreliable in certain tasks, specifically coding, where GPT-4 remains the front-runner—even though Orion has better language skills.
Bloomberg also reported the same, citing two anonymous sources, stating that Orion “fell short” and “is so far not considered to be as big a step up from OpenAI’s existing models as GPT-4 was from GPT-3.5.”
Reuters reports that “researchers at major AI labs have been running into delays and disappointing outcomes in the race to release a large language model that outperforms OpenAI’s GPT-4 model, which is nearly two years old, according to three sources familiar with private matters.”
This adds a certain truth to years of speculation—AI is close to hitting a scaling wall. And we cannot say that we are surprised. Approximately six to seven years ago, data protection laws came into effect in Europe, followed by the U.S. and Asia. These laws affected millions of businesses, making them not only responsible for protecting Personally Identifiable Information (PII) but also liable in the case of data breaches and exposure. This disrupted the global data ecosystem as collecting, using, and sharing customer data became exponentially more complex—both legally and technically. One of the early contributors to data scarcity was these data privacy laws, which made private data non-accessible. But that’s not all. Raw human data is generated daily in abundance through social media, IoT, credit cards, and more, but this data almost always requires extensive cleaning, labeling, and processing to be useful for AI/ML model training.
1. Is High-Quality Training Data a Scarce Resource?
Are we running out of data? Has AI scaling hit a wall? These questions have been asked repeatedly over the past few months. While OpenAI’s CEO and co-founder, Sam Altman, remains optimistic, research into available public text data to train LLMs suggests otherwise.
Studies indicate that public, human-generated text data—critical for AI training—may be exhausted between 2026 and 2032, posing a critical bottleneck for traditional AI development approaches.
The 2024 AI Index Report by Stanford University highlights a significant concern: the potential depletion of high-quality language data for AI training by 2024, with low-quality language data possibly exhausted within two decades, and image data fully consumed between the late 2030s and mid-2040s.
As briefly discussed above, to understand data scarcity, specifically in the context of AI development and machine learning, we need to consider two key factors:
a. Enterprises Must Protect Data:
Privacy is a two-fold challenge for enterprises knee-deep in data:
i. Managing Legality
Enterprises must comply with evolving privacy regulations like GDPR, CCPA, and PDPC. These laws vary by region, making compliance complex. Organizations need strong legal frameworks to manage consent, data storage, cross-border transfers, and retention policies. Failure to comply can lead to heavy fines and reputational harm.
ii. Managing Technology
Enterprises must invest in Privacy-Enhancing Technologies (PETs) to protect data. Key solutions include encryption, access controls, data masking, and synthetic data, especially for AI/ML applications.
Both of these combined create substantive challenges for enterprises to train AI models, such as:
- Limited Data Access: Compliance with laws like PDPC, GDPR and CCPA often requires obtaining explicit user consent. Many users opt out, limiting the availability of data for processing and analysis.
- Data Minimization Requirements: Regulations mandate that only the necessary amount of data be collected and processed, reducing the volume of available data.
- Restricted Cross-Border Data Flows: Laws like PDPC impose restrictions on data transfers across regions, making it harder for global enterprises to consolidate data for analysis.
- Retention Periods: Regulations enforce strict timelines for data retention. Enterprises must delete data after a certain period, even if it could be valuable for long-term projects.
- Data Anonymization and Masking: PETs like anonymization and masking transform sensitive data into less detailed or less usable forms, making it harder to extract insights or use them for training AI/ML models.
- Access Controls and Encryption: Strict access controls and encryption policies can limit data sharing across teams or systems, reducing the amount of data available for collaboration and innovation.
- Cost of Compliance Technology: Investments in PETs can strain resources, leading to the prioritization of certain data projects over others, indirectly reducing the usable data pool.
These restrictive measures hinder operations, with enterprises struggling to integrate legacy data anonymization systems with modern privacy-enhancing technologies, making data management and protection even more challenging. This is why most AI labs use publicly available text data for AI scaling. But that has its problems.
b. There Just Isn’t Enough Good Quality Public Data:
AI needs structured data to reason, i.e., high-fidelity data organized and labeled neatly in rows and columns. While we do have an abundance of data, with trillions of new data points generated daily through social media, online messaging, internet browsing, and so on, a large portion of this public data is noisy, unstructured, or irrelevant for many AI applications—especially scaling AI to general intelligence.
The situation becomes far worse for specific use cases in fields like healthcare or finance, where biased, incomplete, and unbalanced data leads to ineffective models with poor reasoning and logical capabilities. This lack of high-dimensional, high-depth, and high-variety data explains why state-of-the-art LLMs like Orion or Llama have not achieved significant breakthroughs similar to their predecessors, such as the leap from GPT-3 to GPT-4.
2. What Can Synthetic Data Do?
Epoch AI’s report “Can AI Scaling Continue Through 2030?” estimated that AI models will scale up by 10,000x by 2030 and require training datasets of unprecedented size—potentially exceeding trillions of data points, with a demand 80,000x greater than today. Current human-generated datasets fall short of this demand, with only a finite supply of structured, labeled, and high-quality data available.
Synthetic data is a scalable, diverse, and privacy-compliant alternative that bypasses most challenges. You can call it a wall-shattering approach for advanced AI training. Synthetic data is generated through advanced machine learning algorithms such as GANs, LLMs, and DGMs. These GenAI models are first trained on real data, where they learn its statistical properties, distributions, patterns, and features, after which they generate real-like synthetic data.
The generated synthetic data does not contain any PII, which makes it secure and free of any data protection liability. These models also allow users to enhance synthetic data by filling in missing information, removing biases and irregularities, and balancing training datasets. Synthetic data is also scalable, meaning it can generate more data points with the same patterns as real data, especially for edge cases or rare scenarios. All of this enables data and AI teams working on machine learning to:
- Generate synthetic data to meet the specific needs of an AI model, eliminating dependence on limited or hard-to-access real-world data.
- Quickly create data in large quantities, reducing the time needed for data collection and preprocessing, speeding up AI model training and iterations, and accelerating development timelines.
- Increase the robustness and generalizability of AI models by exposing them to a broader range of inputs.
- Reduce the need for expensive data collection and annotation processes.
- Ensure compliance with data protection regulations like PDPC, GDPR, and HIPAA, making data more accessible for development.
- Facilitate testing under controlled conditions, ensuring AI performs well in targeted use cases.
- Provide data for nascent technologies or industries where real-world datasets are unavailable (e.g., quantum computing, new environmental studies) or bridge gaps in data availability for underrepresented geographic regions or demographics.
3. The Future of AI:
Data scarcity is a problem directly proportional to advancements in AI. As AI performance exponentially increases, so will the need for massive amounts of data. In a scenario where structured and labeled data is limited, synthetic data empowers AI labs to build models more efficiently, ethically, and cost-effectively while improving model performance across diverse applications, making the future of AI bright.
It is also worth mentioning that increasing data quantity and quality is not the only strategy to overcome the apparent scaling wall. While important, how we train our models is equally critical. Training smarter models is not always about scaling them 100x. For example, AI developers are experimenting with ‘test-time compute,’ which improves the reasoning capabilities of existing models by giving them more time to think and analyze different options, ultimately choosing the best possible answers.
Scaling AI models is a complex equation built on three key factors: training data, computing power, and model development strategy. To move forward, improving all three factors becomes essential—though easier said than done.