我们刚刚发布了 IQ AI.
Yuanzhi Li is a computer scientist and researcher specializing in artificial intelligence and the theoretical foundations of machine learning. He is recognized for his contributions to understanding deep learning, optimization algorithms, and his work on large language models, including as a key contributor to Microsoft's Phi series of models, he recently joined the Meta Superintelligence Team. [3]
Yuanzhi Li attended Princeton University, where he earned a Ph.D. in computer science in 2018. His doctoral dissertation was titled "On the ability of gradient descent to learn neural networks," which investigated the theoretical principles governing the training of neural networks through gradient-based optimization techniques. [1] [3]
Li has built a career as a prolific researcher in machine learning, with an extensive record of publications in top-tier academic venues. His early research focused on the theoretical underpinnings of optimization, reinforcement learning, and matrix factorization. Based on his co-authorship of numerous technical reports and research papers, Li was a researcher at Microsoft Research. During his tenure, he was a central figure in the development of the Phi series of small language models (SLMs), which gained significant attention for achieving high performance with a smaller parameter count compared to many larger models. He co-authored the technical reports for phi-1.5, Phi-3, and Phi-4, contributing to a line of research focused on the impact of data quality on model capabilities.
In July 2025, it was reported that Li was recruited by Meta Platforms to join its artificial intelligence research division. The move was part of a broader talent acquisition effort by Meta to enhance its AI research capabilities. A report from the South China Morning Post identified Li as one of several experts joining Meta's Superintelligence Labs. [2] [1] [3] [4] [5]
Li's research covers a broad spectrum of topics within machine learning and theoretical computer science. His work often seeks to answer fundamental questions about the mechanisms, capabilities, and limitations of deep learning models, with a focus on optimization dynamics, generalization, and feature learning. He has published over 200 papers in prominent conferences and journals, including NeurIPS, ICML, ICLR, COLT, FOCS, and STOC. [1] [5]
A significant portion of Li's research is dedicated to the theoretical properties of neural networks. He has co-authored foundational papers on the convergence and behavior of optimization algorithms such as Stochastic Gradient Descent (SGD) and Adam, especially within the context of over-parameterized models that are common in modern deep learning. His work in this area explores concepts like the implicit bias of algorithms, the critical role of initialization and learning rates in determining training outcomes, and the mechanisms that underlie adversarial robustness and self-supervised learning. Key publications that reflect his contributions to deep learning theory include "A Convergence Theory for Deep Learning via Over-Parameterization," "Backward Feature Correction: How Deep Learning Performs Deep (Hierarchical) Learning," and "Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks." [1] [5]
In recent years, Li has shifted his focus to the principles and emergent abilities of large language models (LLMs). He was a co-author of the influential 2023 paper "Sparks of Artificial General Intelligence: Early experiments with GPT-4," which analyzed the advanced reasoning and problem-solving capabilities of the model, suggesting early signs of artificial general intelligence. He is also a primary author of the "Physics of Language Models" series of papers, which aims to establish a theoretical framework for understanding how LLMs store knowledge, manipulate information, and perform complex reasoning tasks.
Another of his significant contributions to the field is the paper "LoRA: Low-Rank Adaptation of Large Language Models." This work introduced a parameter-efficient fine-tuning technique that dramatically reduces the computational cost of adapting large pre-trained models to specific downstream tasks. LoRA has since become a widely adopted and standard method in the practical application of LLMs. [1] [5] [6]
While at Microsoft Research, Li was a key member of the team that developed the Phi family of small language models. He is listed as a co-author on the technical reports for "Textbooks Are All You Need" (which introduced the concepts behind Phi-1), "Textbooks Are All You Need II: phi-1.5 technical report," the "Phi-3 Technical Report," and the "Phi-4 Technical Report." This research demonstrated that models trained on high-quality, "textbook-like" data could achieve performance on reasoning and language understanding benchmarks that was comparable to, or even exceeded, that of much larger models. The work challenged the prevailing view that model capability is primarily a function of scale (i.e., parameter count) and highlighted the critical importance of training data quality and curation.
In addition to his work on deep learning theory and LLMs, Li has made contributions to other areas of machine learning, including reinforcement learning, generative modeling, and convex optimization. His research in these domains includes theoretical analyses of Generative Adversarial Networks (GANs), the development of theory for diffusion models, and the design of efficient bandit algorithms. Representative papers from these areas include "Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions" and "Settling the Horizon-Dependence of Sample Complexity in Reinforcement Learning." [1] [5]
On June 6, 2023, the Cognitive Revolution podcast featured a discussion between Nathan Labenz, Ronen Eldan, and Yuanzhi Li of Microsoft Research regarding the Tiny Stories project. Li explained that the project involves a synthetic dataset of approximately 1.5 million children’s stories generated using GPT-4 and GPT-3.5. The dataset employs a restricted vocabulary of around 2,000 simple words and was designed to enable the training of small-scale language models ranging from 1 million to 33 million parameters, representing about 2% of GPT-2’s size.
According to Li, the project provides a framework for examining the development of core language abilities, such as grammar, factual recall, and basic logical operations, within smaller models. He stated that model depth is associated with the complexity of reasoning processes, while model width is linked to memory capacity for factual information. The models’ attention mechanisms were described as exhibiting two main patterns: “distance heads,” which focus on positional relationships between tokens, and “semantic heads,” which prioritize content relevance.
Li also noted that reasoning tasks are relatively uncommon in large-scale natural language datasets and may compete with factual memorization for model capacity. The Tiny Stories dataset, he explained, can be used to apply a form of curriculum learning in which linguistic and reasoning skills are introduced in a structured manner. In terms of interpretability, Li indicated that smaller models tend to allow clearer identification of neuron and attention head functions, whereas larger models distribute functions across more parameters, making them harder to analyze. He compared the practical control of models to horseback riding, where effective use does not require a complete understanding of internal processes.
The discussion outlined how the Tiny Stories framework can be applied to study the behavior, reasoning capabilities, and interpretability of language models under computationally limited conditions. [7]