Research & LibraryResearch Articles Library Leaderboard

Better Data Beats More Data

Li, Y., Bubeck, S., Eldan, R., Del Giorno, A., Gunasekar, S., & Lee, Y. T. (2023). Textbooks Are All You Need II: phi-1.5 Technical Report. arXiv:2309.05463.

Read Original Paper

Better Data Beats More Data - Research Breakthrough Illustration

The 2023 Phi-2 paper from Microsoft Research challenged the standard scaling laws of artificial intelligence by prioritizing data quality over volume. While most models were being trained on trillions of tokens of raw web-crawled text, researchers focused on high-signal, textbook-quality data. By curating a mixture of filtered educational content and synthetic reasoning examples, they developed a 2.7-billion parameter model that could match the reasoning capabilities of systems twenty-five times its size. This shift demonstrated that model intelligence is a function of the clarity of the training signal rather than the sheer volume of exposure.

The use of synthetic reasoning data allows for the targeted teaching of specific logical concepts. Researchers employed larger models to generate thousands of short stories and exercises designed to demonstrate common sense, science, and social logic. This approach suggests that synthetic data can be more effective at addressing cognitive gaps than raw human-generated text. The model acts as its own instructor, creating digestible examples for a smaller architecture. This focus on structured knowledge also led to a significant reduction in toxic content, as the model was never exposed to the biases common in raw web data.

A model with 2.7 billion parameters can outperform much larger architectures on reasoning and coding tasks if its training signal is sufficiently dense. This finding suggests that the efficiency frontier of small models is further than previously assumed, making high-level logic accessible on mobile and edge devices. Intelligence in this context is framed as the ability to process high-signal information without the overhead of noise and repetition. This discovery shifts the emphasis of AI research from the brute force of data collection to the surgical curation of knowledge.

The success of Phi-2 indicates that the scaling of models may be limited by data quality rather than computation. If the clarity of information determines reasoning performance, then the next generation of AI may come from a deeper understanding of how to construct the ideal dataset for specific cognitive goals. This moves the field toward a more deliberate and scientific approach to teaching machines. In the future, the content of what a model learns will likely be more important than the total amount of data it has seen.

As models become more efficient at learning from high-quality sources, the cost of building capable AI systems is expected to decrease. This democratization of intelligence allows for the deployment of advanced reasoning in environments with limited power or memory. The principles established by the Phi project provide a blueprint for creating compact, safe, and highly capable systems that do not rely on massive infrastructure. The focus on textbook-quality data marks a significant step in the move toward more efficient and reliable artificial intelligence.

Join the EulerFold community

Track progress and collaborate on roadmaps with students worldwide.

🐢

#Dive Deeper

Phi-2 Blog Post
Microsoft Research • article
Explore Resource
Phi-1.5 Technical Report
arXiv • article
Explore Resource
Phi-2 Model on Hugging Face
Hugging Face • model
Explore Resource

Discussion

Join the discussion

Loading insights...

Join the EulerFold community

#Dive Deeper

Discussion

Join the discussion

Recommended Readings