Haotian Tang

Haotian Tang is a computer scientist specializing in systems and machine learning (SysML). His work focuses on efficient deep learning, particularly for 3D perception and large-scale foundation models. His career includes academic research at the Massachusetts Institute of Technology (MIT) and industry positions at companies including Waymo, NVIDIA, Google DeepMind, and Meta Superintelligence Labs. ^[1] ^[2]

Education

Tang attended Shanghai Jiao Tong University (SJTU), where he graduated in 2020 with a Bachelor of Engineering degree in Computer Science and Technology. Following his graduation from SJTU, Tang enrolled at the Massachusetts Institute of Technology (MIT). He earned a Master of Science in Electrical Engineering and Computer Science in 2022 and is a Ph.D. candidate in the same department, with an expected graduation in 2025. At MIT, he is a member of the Han Lab, advised by Professor Song Han. ^[1] ^[3]

Career

Tang began his career with a software engineering internship at Agora.io in 2017. While studying at Shanghai Jiao Tong University in 2019, he undertook a research internship at Tencent, where he worked on computer vision and machine learning, and also served as a research assistant in the university's Department of Computer Science. From 2019 to 2020, he worked as a remote research intern under the guidance of MIT Professor Song Han, focusing on efficient 3D deep learning. ^[1]

In 2020, Tang commenced his Ph.D. studies at MIT, where his research centered on systems and machine learning. His work during this period led to multiple publications on topics such as 3D neural networks, hardware efficiency for sparse data, and multi-sensor fusion for autonomous systems. Alongside his academic research, Tang completed several industry internships. In 2022, he interned at OmniML, which was later acquired by NVIDIA. In 2023, he was a research intern at Waymo, where he worked on multimodal behavior prediction. He followed this with an internship at NVIDIA in 2024, where his work focused on developing efficient visual generation models. ^[1]

In early 2025, Tang joined Google DeepMind as a research scientist, contributing to large-scale pretraining for world simulation projects. Later that year, he moved to Meta as a research scientist on the Superintelligence team, where he worked on multimodal foundation models. ^[1]

Research and Publications

Tang's research addresses efficiency and performance challenges in deep learning systems. His work spans the co-design of algorithms and systems for large language models (LLMs), 3D point cloud processing for autonomous driving, and multi-sensor fusion. ^[1]

Efficient Large Language Models

A significant portion of Tang's research has been dedicated to making large language models more efficient for both inference and fine-tuning.

AWQ (Activation-aware Weight Quantization): Tang was the system co-lead for the AWQ project, which introduced a hardware-friendly, low-bit weight-only quantization method for LLMs. The method is based on the observation that protecting a small percentage of salient weights can significantly reduce quantization error without requiring backpropagation or data reconstruction. This work received the Best Paper Award at MLSys 2024. ^[1]
QServe: As a system design lead, Tang contributed to QServe, an inference engine designed for efficient cloud-based LLM serving. QServe utilizes a W4A8KV4 quantization scheme (4-bit weights, 8-bit activations, 4-bit KV cache) to accelerate inference. The system incorporates techniques like compute-aware weight reordering and fused attention to reduce dequantization overhead and memory bandwidth, reportedly allowing less expensive GPUs to match the throughput of higher-end hardware. ^[1]
LongLoRA: This project presented an efficient method for fine-tuning LLMs to handle long context sizes. The approach uses a shifted sparse attention mechanism during fine-tuning to reduce the computational cost, which typically scales quadratically with context length. The method allows models to be extended to much longer context windows with limited computational resources. ^[1]

3D Deep Learning and Point Clouds

Tang has also worked extensively on optimizing deep learning models for sparse and irregular 3D point cloud data, which is crucial for applications such as autonomous driving and augmented reality.

TorchSparse and TorchSparse++: Tang was a leading author on TorchSparse and its successor, TorchSparse++. These are high-performance GPU libraries designed to accelerate sparse convolution operations common in point cloud processing. The frameworks introduce a kernel generator and an autotuner to optimize data flows for both training and inference, achieving significant speedups over existing libraries such as MinkowskiEngine and SpConv. ^[1]
BEVFusion: This work introduced a framework for multi-task, multi-sensor fusion that unifies features from different sensors, such as cameras and LiDAR, into a shared bird's-eye view (BEV) representation. By preserving both geometric and semantic information in this unified space, BEVFusion improved performance on 3D object detection and map segmentation tasks while reducing computational costs. ^[1]
SPVNAS and PVCNN: In his earlier work, Tang co-authored papers on novel 3D neural network primitives. Point-Voxel CNN (PVCNN) combined the memory efficiency of point-based representations with the computational locality of voxel-based convolutions. Building on this, Sparse Point-Voxel Convolution (SPVConv) and 3D Neural Architecture Search (SPVNAS) were developed to automatically discover efficient and accurate network architectures for 3D scene understanding, achieving high performance on benchmarks such as SemanticKITTI. ^[1]

Education

Career

Research and Publications

Efficient Large Language Models

A significant portion of Tang's research has been dedicated to making large language models more efficient for both inference and fine-tuning.

AWQ (Activation-aware Weight Quantization): Tang was the system co-lead for the AWQ project, which introduced a hardware-friendly, low-bit weight-only quantization method for LLMs. The method is based on the observation that protecting a small percentage of salient weights can significantly reduce quantization error without requiring backpropagation or data reconstruction. This work received the Best Paper Award at MLSys 2024. ^[1]
QServe: As a system design lead, Tang contributed to QServe, an inference engine designed for efficient cloud-based LLM serving. QServe utilizes a W4A8KV4 quantization scheme (4-bit weights, 8-bit activations, 4-bit KV cache) to accelerate inference. The system incorporates techniques like compute-aware weight reordering and fused attention to reduce dequantization overhead and memory bandwidth, reportedly allowing less expensive GPUs to match the throughput of higher-end hardware. ^[1]
LongLoRA: This project presented an efficient method for fine-tuning LLMs to handle long context sizes. The approach uses a shifted sparse attention mechanism during fine-tuning to reduce the computational cost, which typically scales quadratically with context length. The method allows models to be extended to much longer context windows with limited computational resources. ^[1]

3D Deep Learning and Point Clouds

Tang has also worked extensively on optimizing deep learning models for sparse and irregular 3D point cloud data, which is crucial for applications such as autonomous driving and augmented reality.

TorchSparse and TorchSparse++: Tang was a leading author on TorchSparse and its successor, TorchSparse++. These are high-performance GPU libraries designed to accelerate sparse convolution operations common in point cloud processing. The frameworks introduce a kernel generator and an autotuner to optimize data flows for both training and inference, achieving significant speedups over existing libraries such as MinkowskiEngine and SpConv. ^[1]
BEVFusion: This work introduced a framework for multi-task, multi-sensor fusion that unifies features from different sensors, such as cameras and LiDAR, into a shared bird's-eye view (BEV) representation. By preserving both geometric and semantic information in this unified space, BEVFusion improved performance on 3D object detection and map segmentation tasks while reducing computational costs. ^[1]
SPVNAS and PVCNN: In his earlier work, Tang co-authored papers on novel 3D neural network primitives. Point-Voxel CNN (PVCNN) combined the memory efficiency of point-based representations with the computational locality of voxel-based convolutions. Building on this, Sparse Point-Voxel Convolution (SPVConv) and 3D Neural Architecture Search (SPVNAS) were developed to automatically discover efficient and accurate network architectures for 3D scene understanding, achieving high performance on benchmarks such as SemanticKITTI. ^[1]

Haotian Tang

Education

Career

Research and Publications

Efficient Large Language Models

3D Deep Learning and Point Clouds

REFERENCES

Haotian Tang

Haotian Tang

Education

Career

Research and Publications

Efficient Large Language Models

3D Deep Learning and Point Clouds

REFERENCES

Subscribe to wiki

Share wiki

Bookmark

Haotian Tang

Wiki Details

Profile Summary

Haotian Tang

Education

Career

Research and Publications

Efficient Large Language Models

3D Deep Learning and Point Clouds

Feedback

Commit Info

Related Articles

Media

Wiki Details

Profile Summary

Feedback

Commit Info

Related Articles

Media

REFERENCES

Subscribe to wiki

Share wiki

Bookmark

Haotian Tang

Wiki Details

Profile Summary

Haotian Tang

Education

Career

Research and Publications

Efficient Large Language Models

3D Deep Learning and Point Clouds

Feedback

Commit Info

Related Articles

Media

Wiki Details

Profile Summary

Feedback

Commit Info

Related Articles

Media

REFERENCES