Johan Schalkwyk
Johan Schalkwyk is a computer scientist known for his work in artificial intelligence, particularly in speech technology and large language models. He has held significant roles at Google, contributing to areas such as voice search and multimodal AI, and currently serves as a strategic advisor at Sense, focusing on AI for the energy transition, and recently joined Meta Superintelligence Labs. [1] [2] [36]
Education
Johan Schalkwyk earned a Master of Engineering (M.Eng.) degree in Robotics, with a focus on reinforcement learning, from the University of Pretoria. He completed the program in 1993 with a GPA of 4.0. [39]
Career
Johan Schalkwyk spent a significant portion of his career at Google, where he was recognized as a Google Fellow in AI. His work at Google spanned several key areas within artificial intelligence and machine learning. As the Speech Area Tech Lead, he directed strategic research investments in speech recognition and synthesis technologies. This leadership contributed to innovations such as the development of the world's first voice-enabled search experience, Google Voice Search, launched in 2008. He also played a role in advancing concepts like on-device processing and the application of neural models across various Google products, including Google Assistant and YouTube, expanding support to over 80 languages. Later, at Google DeepMind, he was involved in the development of multimodal perception and advancements in Large Language Models, including contributions to the Gemini family of models. [1] [2]
In May 2024, Schalkwyk joined Sense, a company specializing in embedded intelligence for homes and the electrical grid, as a Strategic Advisor on Artificial Intelligence. In this role, he focuses on leveraging AI and machine learning to support the global energy transition. His advisory work at Sense aims to develop new tools for utilities and consumers using data and machine learning to manage energy demand, improve efficiency, and enhance grid safety. Sense utilizes machine learning to provide consumers with real-time insights into their home energy usage and offers utilities grid intelligence for tasks such as fault identification, power flow tracking, and planning for electrification. [1] [37] [38]
Meta Superintelligence Labs
In June 2025, Mark Zuckerberg announced the creation of Meta Superintelligence Labs (MSL), a new organization within Meta Platforms focused on developing artificial superintelligence. Johan Schalkwyk was named as one of the key new team members joining this initiative. MSL was established to house various teams working on foundation models, including the Llama software, products, and Fundamental Artificial Intelligence Research projects. The formation of MSL and the recruitment of top AI talent like Schalkwyk are part of Meta's efforts to compete in the rapidly evolving AI landscape. [40] [41]
Publications
Johan Schalkwyk has co-authored numerous research papers in the field of computer science, focusing on areas such as speech recognition, natural language processing, and machine learning. His publications span several decades and appear in prominent conferences and journals.
Key publications include contributions to:
- 2020 – today
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. CoRR abs/2403.05530 (2024) [3]
- Coupling Speech Encoders with Downstream Text Models. CoRR abs/2407.17605 (2024) [4]
- SLM: Bridge the Thin Gap Between Speech and Text Foundation Models. ASRU 2023: 1-8 (2023) [5]
- Lego-Features: Exporting Modular Encoder Features for Streaming and Deliberation ASR. ICASSP 2023: 1-5 (2023) [6]
- Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages. CoRR abs/2303.01037 (2023) [7]
- AudioPaLM: A Large Language Model That Can Speak and Listen. CoRR abs/2306.12925 (2023) [8]
- Gemini: A Family of Highly Capable Multimodal Models. CoRR abs/2312.11805 (2023) [9]
- 2010 – 2019
- On lattice generation for large vocabulary speech recognition. ASRU 2017: 228-235 (2017) [10]
- Speech Research at Google to Enable Universal Speech Interfaces. New Era for Robust Speech Recognition, Exploiting Deep Learning 2017: 385-399 (2017) [11]
- Long short term memory neural network for keyboard gesture decoding. ICASSP 2015: 2076-2080 (2015) [12]
- Learning acoustic frame labeling for speech recognition with recurrent neural networks. ICASSP 2015: 4280-4284 (2015) [13]
- Voice Query Refinement. INTERSPEECH 2012: 2462-2465 (2012) [14]
- A Filter-Based Algorithm for Efficient Composition of Finite-State Transducers. Int. J. Found. Comput. Sci. 22(8): 1781-1795 (2011) [15]
- Voice search for development. INTERSPEECH 2010: 282-285 (2010) [16]
- On-demand language model interpolation for mobile speech input. INTERSPEECH 2010: 1812-1815 (2010) [17]
- Query language modeling for voice search. SLT 2010: 127-132 (2010) [18]
- Filters for Efficient Composition of Weighted Finite-State Transducers. CIAA 2010: 28-38 (2010) [19]
- 2000 – 2009
- OpenFst. FSMNLP 2009: 47 (2009) [20]
- Mobile media search. ICASSP 2009: 4897-4900 (2009) [21]
- Language modeling for what-with-where on GOOG-411. INTERSPEECH 2009: 991-994 (2009) [22]
- A generalized composition algorithm for weighted finite-state transducers. INTERSPEECH 2009: 1203-1206 (2009) [23]
- Semantic context effects in the recognition of acoustically unreduced and reduced words. INTERSPEECH 2009: 1867-1870 (2009) [24]
- Deploying GOOG-411: Early lessons in data, measurement, and testing. ICASSP 2008: 5260-5263 (2008) [25]
- OpenFst: A General and Efficient Weighted Finite-State Transducer Library. CIAA 2007: 11-23 (2007) [26]
- Speech recognition with dynamic grammars using finite-state transducers. INTERSPEECH 2003: 1969-1972 (2003) [27]
- 1990 – 1999
- Universal speech tools: the CSLU toolkit. ICSLP 1998 (1998) [28]
- Experiments with a spoken dialogue system for taking the US census. Speech Commun. 23(3): 243-260 (1997) [29]
- CSLUsh: an extendible research environment. EUROSPEECH 1997: 689-692 (1997) [30]
- Speaker verification with low storage requirements. ICASSP 1996: 693-696 (1996) [31]
- Building 10, 000 spoken dialogue systems. ICSLP 1996: 709-712 (1996) [32]
- Speech recognition using syllable-like units. ICSLP 1996: 1117-1120 (1996) [33]
- Detecting an imposter in telephone speech. ICASSP (1) 1994: 169-172 (1994) [34]
- A prototype voice-response questionnaire for the u.s. census. ICSLP 1994: 683-686 (1994) [35]
His work includes contributions to the OpenFst library, a toolkit for constructing and manipulating weighted finite-state transducers, which are widely used in speech and language processing applications. [26] [20]