Alexander Kolesnikov

Alexander Kolesnikov is an artificial intelligence researcher specializing in computer vision, deep representation learning, and transfer learning. He is noted for his contributions to influential models such as the Vision Transformer (ViT) and his work at major AI labs, including Google, OpenAI, and Meta Superintelligence Labs.

Education

Kolesnikov pursued his doctoral studies at the Institute of Science and Technology (IST) Austria, where he was enrolled as a PhD student from 2013 to 2018. Under the supervision of Christoph H. Lampert, his research focused on computer vision, transfer learning, and deep representation learning, areas that have remained central to his subsequent career. ^[1] ^[9]

Career

After completing his PhD in 2018, Kolesnikov joined Google as a researcher, working within its Google Brain and DeepMind divisions for approximately seven years. During this time, he was involved in the development of several significant projects in the field of computer vision. His work at Google included contributions to the Vision Transformer (ViT), MLP-Mixer, and the big_vision open-source codebase, which became a platform for large-scale vision research.

In December 2024, Kolesnikov announced his departure from Google to join OpenAI. He, along with colleagues Xiaohua Zhai and Andreas Giffoul, was tasked with establishing a new OpenAI office in Zurich, Switzerland.

His tenure at OpenAI was brief. In June 2025, it was reported that Meta Platforms had hired Kolesnikov, Lucas Beyer, and Xiaohua Zhai from OpenAI's Zurich office. The team was recruited to join Meta's efforts in developing Superintelligence. ^[1] ^[3] ^[9] ^[10] ^[11]

Major Works

Kolesnikov has been a key author and contributor to numerous influential research papers and open-source projects that have advanced the field of computer vision and AI.

Vision Transformer (ViT)

Kolesnikov was part of the Google research team that developed the Vision Transformer (ViT), an architecture that applied the Transformer model, originally successful in natural language processing, to computer vision tasks. The ViT model processes images by splitting them into patches and treating them as a sequence, similar to how words are handled in a sentence. This approach demonstrated that a pure Transformer architecture could achieve state-of-the-art results on image classification tasks, challenging the long-standing dominance of convolutional neural networks (CNNs). In October 2020, Kolesnikov announced the public release of pre-trained ViT models and the corresponding code for fine-tuning and inference, which facilitated widespread adoption and further research by the AI community. ^[4]

MLP-Mixer

In May 2021, Kolesnikov was involved in the introduction of MLP-Mixer, a novel vision architecture based exclusively on multi-layer perceptrons (MLPs). The model, often referred to as "Mixer," avoids the use of convolutions and self-attention mechanisms, which were standard in leading vision models at the time. Instead, it operates by repeatedly applying MLPs across either spatial locations (mixing per-location features) or feature channels (mixing per-patch features). The research demonstrated that complex, specialized architectural components were not strictly necessary to achieve strong performance on vision benchmarks. The code and pre-trained models for MLP-Mixer were also made publicly available. ^[5]

`big_vision` Codebase

Kolesnikov was a primary developer of big_vision, a Google research codebase designed for large-scale pre-training and transfer learning in computer vision. The repository served as the original development home for models like ViT, MLP-Mixer, and LiT (Locked-image Tuning). He announced its public release in May 2022, highlighting its utility for conducting research with an emphasis on training large models and evaluating their transfer capabilities across various downstream tasks. The codebase has been used to develop and release other models, including PaliGemma. ^[6]

Vision-Language Models

Kolesnikov has contributed to the development of vision-language models (VLMs), which are designed to understand and process information from both images and text. In May 2024, he announced the release of PaliGemma-3B, a VLM based on Google's Gemma architecture. The model was made available through various platforms, including GitHub, Google Colab, Kaggle, Hugging Face, and Vertex AI, to encourage fine-tuning for specific applications. His work in this area also includes contributions to PaLI-3, another line of vision-language models. ^[7] ^[1]

Reward-Based Model Tuning

In 2023, Kolesnikov co-authored research exploring the use of policy gradient methods, a technique from reinforcement learning (RL), to fine-tune computer vision models. The study, titled "Tuning Computer Vision Models With Task Rewards," demonstrated that this approach could directly optimize for complex, non-differentiable metrics such as mean Average Precision (mAP) or Panoptic Quality (PQ). This method led to significant performance improvements on tasks like object detection and panoptic segmentation, offering an alternative to traditional loss-based training. ^[8] ^[10]

Interviews

New Vision Architectures Beyond CNNs #01

In a presentation for the IARAI Research channel on October 4, 2021, Alexander Kolesnikov discussed alternative architectures to Convolutional Neural Networks (CNNs), which have been widely used in computer vision for nearly a decade.

He outlined two models introduced in recent research: the Vision Transformer (ViT) and the MLP-Mixer. The Vision Transformer applies the Transformer framework, originally developed for natural language processing, to image analysis by dividing images into patches. This structure removes the locality constraint inherent to CNNs and enables global attention from the earliest layers.

The MLP-Mixer was presented as a simpler design, based solely on multilayer perceptron (MLP) layers. It alternates between mixing information across image patches and across channels, without using convolution or self-attention mechanisms. Despite its simplified structure, it achieved competitive results in several vision tasks.

According to Kolesnikov, these models suggest that strict locality is not a necessary condition for effective vision architectures. He emphasized the role of large-scale pretraining, the adaptability of models such as ViT and MLP-Mixer, and the potential application of these approaches to tasks beyond image classification. He also noted that ongoing research continues to explore architectural design, regularization strategies, self-supervised learning, and extensions to tasks such as segmentation and detection. ^[12]

Subscribe to wiki

Share wiki

Bookmark

Wiki Details

Profile Summary