Xiaohua Zhai

Xiaohua Zhai (翟晓华) is a computer science researcher known for his work in computer vision, multimodal learning, and large-scale artificial intelligence models. He has contributed to the development of influential models and techniques, including the Vision Transformer (ViT), Big Transfer (BiT), and Sigmoid Loss for Language Image Pre-Training (SigLIP) and is part of the Meta Superintelligence Team. ^[1] ^[9]

Education

Zhai attended Peking University, where he earned a Bachelor's degree in Computer Science and Technology from 2005 to 2009. He continued his studies at the same institution, completing a Doctor of Philosophy (Ph.D.) in Computer Science between 2009 and 2014. During his doctoral studies, his Ph.D. advisor was Yuxin Peng. His early research focused on areas such as cross-media retrieval and heterogeneous metric learning. ^[2] ^[3] ^[1]

Career

After completing his Ph.D., Zhai joined Google in 2015 as a Software Engineer. He transitioned to a research role at Google Brain in 2017 and later moved to Google DeepMind in 2023. At Google DeepMind, he held the position of Senior Staff Research Scientist and Tech Lead Manager, leading a multimodal research group based in Zürich. His team focused on developing multimodal datasets like WebLI, creating open-weight models such as SigLIP and PaliGemma, and researching inclusivity in AI through data balancing and cultural diversity studies. After nearly a decade at Google, Zhai announced in late 2024 that he would be joining OpenAI's Zürich office as a Member of Technical Staff.

In mid-2025, Zhai, along with close collaborators Lucas Beyer and Alexander Kolesnikov, announced their move from OpenAI to join Meta. This move was part of a broader recruitment effort by Meta to build its Meta Superintelligence Labs (MSL), a team dedicated to developing advanced AI capabilities. While the trio's addition to the MSL roster was confirmed, their official inclusion was noted to be pending due to technicalities.

Zhai's research has been influential in the fields of computer vision and vision-language modeling. He has co-authored numerous papers that have introduced foundational models and techniques for training large-scale AI systems. His work often focuses on transfer learning, representation learning, and scaling models efficiently.

Zhai was part of the team that developed "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," the paper that introduced the Vision Transformer (ViT). This work demonstrated that a pure transformer architecture, applied directly to sequences of image patches, could achieve state-of-the-art results in image classification, challenging the dominance of convolutional neural networks (CNNs). He also co-authored "Scaling Vision Transformers," which systematically studied the scaling properties of ViTs and showed how performance could be improved by scaling the model size, dataset size, and training compute. This research provided key insights into how to effectively train very large vision models.

A significant portion of Zhai's work centers on pre-training models for general visual representation that can be effectively transferred to various downstream tasks. He was a core contributor to "Big Transfer (BiT): General Visual Representation Learning," which introduced a set of pre-trained models on large datasets (ImageNet-21k and JFT-300M) that achieved high performance on a wide range of vision tasks with minimal fine-tuning. He also co-created the Visual Task Adaptation Benchmark (VTAB), a suite of diverse vision tasks designed to evaluate the generalization capabilities of pre-trained models.

Zhai has made key contributions to multimodal research, particularly in combining vision and language.

LiT (Locked-image Text Tuning): He was a lead author on the paper that introduced this method, which involves fine-tuning a pre-trained language model to work with a frozen, pre-trained image model. This approach proved to be a compute-efficient way to achieve strong zero-shot performance on vision tasks.
SigLIP (Sigmoid Loss for Language Image Pre-Training): Zhai was a lead author on this work, which proposed replacing the standard softmax-based contrastive loss with a simpler sigmoid loss. This change allowed for training with larger batch sizes and improved performance and training stability for vision-language models.

PaLI (Pathways Language and Image Model): He was a contributor to the PaLI family of models, including PaLI-3 and the open-source PaliGemma. These models are versatile, multilingual vision-language models (VLMs) that can handle a wide array of tasks such as image captioning, visual question answering, and object detection.

Zhai has also worked on self-supervised and semi-supervised learning methods. He was a co-author of "S4L: Self-Supervised Semi-Supervised Learning," which explored combining self-supervision with traditional supervised learning to improve model performance, especially in low-data regimes. Another notable work, "Knowledge distillation: A good teacher is patient and consistent," investigated how to improve the distillation process by ensuring the teacher model provides consistent and stable guidance to the student model over time.

Throughout his career, Zhai has been an active member of the machine learning research community. He has served as a reviewer for major AI conferences, including CVPR, ICCV, ICML, ICLR, NeurIPS, and AAAI, as well as for academic journals such as JMLR, TPAMI, and TNNLS. He has also co-organized workshops and tutorials at top conferences, such as the CVPR 2022 tutorial "Beyond Convolutional Neural Networks" and the NeurIPS 2021 workshop "ImageNet: past, present, and future." From 2012 to 2013, during his Ph.D. studies, he served as the Chairman of the 14th CCF YOCSEF GS (China Computer Federation, Young Computer Scientists & Engineers Forum, Graduate Students). ^[1] ^[2] ^[3] ^[4] ^[5] ^[6] ^[7] ^[8] ^[9]

Subscribe to wiki

Share wiki

Bookmark

Wiki Details

Profile Summary

Xiaohua Zhai

Education

Career

Feedback

Commit Info

Related Articles

Media