Yansong Tang

I am a tenure-track Associate Professor of Shenzhen International Graduate School, Tsinghua University, where I direct the IVG@SZ (Intelligent Vision Group at Shenzhen, the sister group of the IVG at Beijing). Before that, I was a postdoctoral researcher at the Department of Engineering Science of the University of Oxford, working with Prof. Philip H. S. Torr. My current research interests lie in computer vision and pattern recognition.

I received my B.S. degree and Ph.D degree with honour from Tsinghua University, advised by Prof. Jie Zhou and Prof. Jiwen Lu. I have also spent time at Prof. Song-Chun Zhu's lab at University of California, Los Angeles (UCLA), and Microsoft Research Asia (MSRA), hosted by Dr. Han Hu and Dr. Xin Tong respectively.

I am looking for self-motivated Master/PhD/Postdoc. If you have top grades or coding skill, and are highly creative and interested in joining my group, please do not hesitate to send me your CV and transcripts of grades by Email after reading this file.

News

2025-01: We are organizing the "GenBot" Workshop in ICLR 2025: Generative Models for Robot Learning.

2024-07: We have organized "VENUE", the only full-day tutorial in ECCV 2024: Recent Advances in Video Content Understanding and Generation.

2024-06: We have won the Long-form Video Question Answering Challenge of the CVPR2024 LOVEU Workshop.

2024-06: We have organized the "MANGO" Workshop in CVPR 2024: New Trends in Multimodal Human Action Perception, Understanding and Generation. The recorded video will be released soon.

Recent Selected Publications [ Full List ]

(*Equal Contribution, #Corresponding Author)

	Learning High-Quality Dynamic Memory for Video Object Segmentation Yong Liu, Ran Yu, Fei Yin, Xinyuan Zhao, Wei Zhao, Weihao Xia, Jiahao Wang, Yitong Wang, Yansong Tang#, and Yujiu Yang# IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025 [Paper] [Code] We propose QDMN, a memory-based dynamic framework for semi-supervised video object segmentation that performs adaptive quality awareness and dynamic update of reference information.
	Language-Aware Vision Transformer for Referring Segmentation Zhao Yang, Jiaqi Wang, Xubing Ye, Yansong Tang#, Kai Chen, Hengshuang Zhao, Philip H.S. Torr IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)*, 2024 [Paper] [Code] [Conference Version] We propose LAVT, a Transformer-based universal referring image and video segmentation (RIS and RVOS) framework that performs language-aware visual encoding in place of cross-modal fusion post feature extraction.
	Comprehensive Instructional Video Analysis: The COIN Dataset and Performance Evaluation Yansong Tang, Jiwen Lu, and Jie Zhou IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021 [Paper] [Project Page] [中文解读] [Conference Version] COIN is currently the largest and most comprehensive instructional video analysis datasets with rich annotations.
	OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments Chubin Zhang, Juncheng Yan, Yi Wei, Jiaxin Li, Li Liu, Yansong Tang#, Yueqi Duan, Jiwen Lu IEEE Transactions on Image Processing (TIP)*, 2025 [Paper] [Project Page] [中文解读] OccNeRF is a LiDAR-free method for 3D occupancy prediction that leverages multi-camera images, temporal photometric consistency, and open-vocabulary segmentation for supervision.
	Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction Jixuan Fan, Wanhua Li, Yifei Han, Tianru Dai, Yansong Tang# IEEE/CVF International Conference on Computer Vision (ICCV), 2025 [arXiv] [Project Page] [Code] We propose Momentum-GS, a self-distillation approach for more accurate and efficient 3D Gaussian Splatting.
	ScoreHOI: Physically Plausible Reconstruction of Human-Object Interaction via Score-Guided Diffusion Ao Li, Jinpeng Liu, Yixuan Zhu, Yansong Tang# IEEE/CVF International Conference on Computer Vision (ICCV), 2025 We propose ScoreHOI, a framework for human-object interaction reconstruction via score-guided diffusion to enhance the physical plausibility.
	AnyBimanual: Transferring Unimanual Policy for General Bimanual Manipulation Guanxing Lu, Tengbo Yu, Haoyuan Deng, Season Si Chen, Yansong Tang#, Ziwei Wang IEEE/CVF International Conference on Computer Vision (ICCV), 2025 [arXiv] [Project Page] [Code] We propose AnyBimanual, a plug-and-play method that transfers pretrained unimanual policy to general bimanual manipulation tasks with few demonstrations.
	Flash-VStream: Efficient Real-Time Understanding for Long Video Streams Haoji Zhang, Yiqin Wang, Yansong Tang#, Yong Liu, Jiashi Feng, Xiaojie Jin# IEEE/CVF International Conference on Computer Vision (ICCV), 2025 [arXiv] [Project Page] [Code] We propose Flash-VStream, an efficient VLM with a novel memory mechanism that enables real-time understanding and querying of extremely long video streams.
	Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation Yong Liu, Songli Wu, Sule Bai, Jiahao Wang, Yitong Wang, Yansong Tang#* IEEE/CVF International Conference on Computer Vision (ICCV), 2025 [arXiv] We introduce OpenBench, a new benchmark designed to more rigorously evaluate open-vocabulary segmentation models on diverse real-world concepts, and propose OVSNet for segmentation in diverse and open scenarios.
	KV-Edit: Training-Free Image Editing for Precise Background Preservation Tianrui Zhu, Shiyi Zhang, Jiawei Shao, Yansong Tang# IEEE/CVF International Conference on Computer Vision (ICCV), 2025 [arXiv] [Project Page] [Code] We propose KV-Edit to address the challenge of background preservation in image editing, thereby enhancing the practicality of AI editing.
	ManiGaussian++: Dynamic Gaussian Splatting for Multi-task Bimanual Manipulation Tengbo Yu, Guanxing Lu, Zaijia Yang, Haoyuan Deng, Season Si Chen, Jiwen Lu, Wenbo Ding, Guoqiang Hu, Yansong Tang#, Ziwei Wang IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)*, 2025 [arXiv] [Code] We propose ManiGaussian++, a hierarchical Gaussian world model that captures multi-body dynamics for improved multi-task bimanual manipulation.
	FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios Shiyi Zhang, Junhao Zhuang, Zhaoyang Zhang, Yansong Tang# ACM SIGGRAPH 2025 Conference Papers (SIGGRAPH), 2025 [arXiv] [Project Page] We achieve action transfer in heterogeneous scenarios with varying spatial structures or cross-domain subjects.
	VoCo-LLaMA: Towards Vision Compression with Large Language Models Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Yansong Tang# IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025 [arXiv] [Project Page] [Code] We propose VoCo-LLaMA, the first approach to compress vision information utilizing the LLMs' understanding paradigm, which can compress hundreds of vision tokens into a single VoCo token with minimal visual information loss.
	ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-ping Zhang, Yansong Tang IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025 [arXiv] [Project Page] We propose ATP-LLaVA, a framework that adaptively determines pruning ratios instance-wise and LLM layer-wise for effective vision token pruning on large vision language models.
	SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes Yuji Wang, Haoran Xu, Yong Liu, Jiaze Li, Yansong Tang# IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025 [arXiv] We propose a novel framework called SAM2-LOVE to effectively segment the video objects referred by the audio and text and achieve significant improvement in Ref-AVS tasks.
	FADE: Frequency-Aware Diffusion Model Factorization for Video Editing Yixuan Zhu, Haolin Wang, Shilin Ma, Wenliang Zhao, Yansong Tang, Lei Chen, Jie Zhou IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025 [arXiv] [Code] We introduce FADE—a training-free yet highly effective video editing approach that fully leverages the inherent priors from pre-trained video diffusion models via frequency-aware factorization.
	WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct Haipeng Luo, Qingfeng Sun, Can Xu#, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen#, Yansong Tang#, Dongmei Zhang The Thirteenth International Conference on Learning Representations (ICLR), 2025 Oral Presentation [arXiv] [Code] We propose a new fully AI-powered automatic reinforcement learning method, Reinforcement Learning from Evol-Instruct Feedback (RLEIF), alongside Math Evol-Instruct and Process Supervision, for improving reasoning performance.
	ThinkBot: Embodied Instruction Following with Thought Chain Reasoning Guanxing Lu, Ziwei Wang, Changliu Liu, Jiwen Lu and Yansong Tang# The Thirteenth International Conference on Learning Representations (ICLR), 2025 [arXiv] [Project Page] We have presented a ThinkBot agent that reasons the thought chain for missing instruction recovery in embodied instruction following (EIF) tasks.
	InstaRevive: One-Step Image Enhancement via Dynamic Score Matching Yixuan Zhu, Haolin Wang, Ao Li, Wenliang Zhao, Yansong Tang, Jingxuan Niu, Lei Chen, Jie Zhou, Jiwen Lu The Thirteenth International Conference on Learning Representations (ICLR), 2025 [arXiv] [Code] We propose InstaRevive, a straightforward yet powerful image enhancement framework that employs score-based diffusion distillation to harness potent generative capability and minimize the sampling steps.
	GeoLRM: Geometry-Aware Large Reconstruction Model for High-Quality 3D Gaussian Generation Chubin Zhang, Hongliang Song, Yi Wei, Chen Yu, Jiwen Lu, Yansong Tang# Thirty-eighth Conference on Neural Information Processing Systems, (NeurIPS), 2024 [arXiv] [Project Page] [Code] This paper proposes a geometry-aware large reconstruction model for sparse-view reconstruction and 3D generation.
	Q-VLM: Post-training Quantization for Large Vision-Language Models Changyuan Wang, Ziwei Wang, Xiuwei Xu, Yansong Tang#, Jie Zhou, Jiwen Lu Thirty-eighth Conference on Neural Information Processing Systems, (NeurIPS), 2024 [arXiv] [Code] This paper aims to achieve efficient inference and memory saving for large vision-language models.
	ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation Guanxing Lu, Shiyi Zhang, Ziwei Wang, Changliu Liu, Jiwen Lu and Yansong Tang The European Conference on Computer Vision (ECCV), 2024 [arXiv] [Project Page] [Code] We propose a dynamic Gaussian Splatting method named ManiGaussian for multi-task robotic manipulation, which mines scene dynamics via future scene reconstruction.
	MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model Wenxun Dai, Linghao Chen, Jingbo Wang, Jinpeng Liu, Bo Dai, Yansong Tang The European Conference on Computer Vision (ECCV), 2024 [arXiv] [Project Page] [Code] We introduces MotionLCM, extending controllable motion generation to a real-time level.
	Plan, Posture and Go: Towards Open-Vocabulary Text-to-Motion Generation Jinpeng Liu, Wenxun Dai, Chunyu Wang, Yiji Cheng, Yansong Tang#, Xin Tong The European Conference on Computer Vision (ECCV), 2024* [arXiv] [Project Page] We present a divide-and-conquer framework named PRO-Motion, which consists of three modules as motion planner, posture-diffuser and go-diffuser.
	Segment and Caption Anything Xiaoke Huang, Jianfeng Wang, Yansong Tang#, Zheng Zhang, Han Hu, Jiwen Lu, Lijuan Wang, Zicheng Liu IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024 [arXiv] [Project Page] [Code] [Tsinghua Twitter] We propose a method to efficiently equip the Segment Anything Model (SAM) with the ability to generate regional captions by introducing a lightweight query-based feature mixer.
	Universal Segmentation at Arbitrary Granularity with Language Instruction Yong Liu, Cairong Zhang, Yitong Wang, Jiahao Wang, Yujiu Yang, Yansong Tang# IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024 [arXiv] [Code] We propose a unified framework to achieve universal segmentation at a wide spectrum of granularities and levels.
	Open-Vocabulary Segmentation with Semantic-Assisted Calibration Yong Liu, Sule Bai, Guanbin Li, Yitong Wang, Yansong Tang# IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024 [arXiv] [Code] We propose an open-vocabulary segmentation (OVS) method by calibrating in-vocabulary and domain-biased embedding space with generalized contextual prior of CLIP.
	Towards Accurate Data-free Quantization for Diffusion Models Changyuan Wang, Ziwei Wang, Xiuwei Xu, Yansong Tang#, Jie Zhou, Jiwen Lu IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024 Highlight [arXiv] [Code] We propose an accurate data-free post-training quantization framework of diffusion models (ADP-DM) for efficient image generation.
	Narrative Action Evaluation with Prompt-Guided Multimodal Interaction Shiyi Zhang, Sule Bai, Guangyi Chen, Lei Chen, Jiwen Lu, Junle Wang, Yansong Tang# IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024 [arXiv] [Code] We investigate a new problem called narrative action evaluation (NAE) and propose a prompt-guided multimodal interaction framework.
	DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery Yixuan Zhu, Ao Li, Yansong Tang#, Wenliang Zhao, Jie Zhou, Jiwen Lu IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024 [arXiv] [Project Page] [Code] We propose a new method to exploit diffusion priors for human mesh recovery (HMR) in occlusion and crowded scenarios.
	FlowIE: Efficient Image Enhancement via Rectified Flow Yixuan Zhu, Wenliang Zhao, Ao Li, Yansong Tang, Jie Zhou, Jiwen Lu IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024 Oral Presentation [arXiv] [Code] We proposed a unified framework for various efficient image enhancement tasks with generative diffusion priors.
	MCUFormer: Deploying Vision Transformers on Microcontrollers with Limited Memory Yinan Liang, Ziwei Wang, Xiuwei Xu, Yansong Tang#, Jie Zhou, Jiwen Lu Thirty-seventh Conference on Neural Information Processing Systems, (NeurIPS), 2023. [arXiv] [Code] [中文解读] We propose a hardware-algorithm co-optimizations method called MCUFormer to deploy vision transformers on microcontrollers with extremely limited memory, where we jointly design transformer architecture and construct the inference operator library to fit the memory resource constraint.
	Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer Guangyi Chen, Xiao Liu, Guangrun Wang, Kun Zhang, Philip H.S. Torr, Xiao-Ping Zhang, Yansong Tang# IEEE/CVF International Conference on Computer Vision (ICCV), 2023 [arXiv] [Project Page] We present Tem-Adapter, a method that improves VQA by leveraging image-based knowledge and introducing temporal and semantic aligners.
	Skip-Plan: Procedure Planning in Instructional Videos via Condensed Action Space Learning Zhiheng Li, Wenjia Geng, Muheng Li, Lei Chen, Yansong Tang#, Jiwen Lu, Jie Zhou IEEE/CVF International Conference on Computer Vision (ICCV), 2023 [arXiv] [Code] We propose Skip-Plan, a condensed action space learning method for procedure planning in instructional videos.
	FLAG3D: A 3D Fitness Activity Dataset with Language Instruction Yansong Tang, Jinpeng Liu, Aoyang Liu, Bin Yang, Wenxun Dai, Yongming Rao, Jiwen Lu, Jie Zhou, Xiu Li IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 2023 [arXiv] [Project Page] We present FLAG3D, a large-scale 3D fitness activity dataset with language instruction.
	LOGO: A Long-Form Video Dataset for Group Action Quality Assessment Shiyi Zhang, Wenxun Dai, Sujia Wang, Xiangwei Shen, Jiwen Lu, Jie Zhou, Yansong Tang# IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023 [PDF] [Project Page] LOGO is a new multi-person long-form video dataset for action quality assessment.
	HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions Yongming Rao, Wenliang Zhao, Yansong Tang, Jie Zhou, Ser-Nam Lim, Jiwen Lu Conference on Neural Information Processing Systems (NeurIPS), 2022 [arXiv] [Project Page] [Code] [中文解读] HorNet is a family of generic vision backbones that perform explicit high-order spatial interactions based on Recursive Gated Convolution.
	OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal Regression Wanhua Li, Xiaoke Huang, Zheng Zhu, Yansong Tang, Xiu Li, Jiwen Lu, Jie Zhou Conference on Neural Information Processing Systems (NeurIPS), 2022 [arXiv] [Project Page] [Code] [中文解读] We present a language-powered paradigm for ordinal regression.
	LAVT: Language-Aware Vision Transformer for Referring Image Segmentation Zhao Yang, Jiaqi Wang, Yansong Tang#, Kai Chen, Hengshuang Zhao, Philip H.S. Torr IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022 [arxiv] [code] [中文解读] We present an end-to-end hierarchical Transformer-based network for referring segmentation.
	BNV-Fusion: Dense 3D Reconstruction using Bi-level Neural Volume Fusion Kejie Li, Yansong Tang, Victor Adrian Prisacariu, Philip H.S. Torr IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022 [arxiv] [code] [中文解读] We present Bi-level Neural Volume Fusion, which leverages recent advances in neural implicit representations and neural rendering for dense 3D reconstruction. In order to incrementally integrate new depth maps into a global neural implicit representation, we propose a novel bi-level fusion strategy that considers both efficiency and reconstruction quality by design.
	DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, Jiwen Lu IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022 [arXiv] [Project Page] [Code] [中文解读] DenseCLIP is a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
	Uncertainty-aware Score Distribution Learning for Action Quality Assessment Yansong Tang, Zanlin Ni, Jiahuan Zhou, Danyang Zhang, Jiwen Lu, Ying Wu, and Jie Zhou IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020 Oral Presentation [arxiv] [Code] We propose an uncertainty-aware score distribution learning method and extend it to a multi-path model for action quality assessment.

Dataset

COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis. [Website] [Download]

FLAG3D: A 3D Fitness Activity Dataset with Language Instruction. [Website] [Download]

THU-READ: Tsinghua University RGB-D Egocentric Action Dataset. [Website] [Download]

LOGO: A Long-Form Video Dataset for Group Action Quality Assessment. [Website] [Download]

Teaching

Reinforcement Learning, Spring 2024/2025

Deep Learning: Frontier and Interdisciplinary Research, Fall 2023/2024/2025

Data Mining: Theory and Algorithms, Fall 2022 (with Prof. Xinlei Chen)

Selected Honors and Awards

Guangdong Natural Science Funds for Distinguished Young Scholar, 2025.

Startrack Program by MSRA, 2023.

Young Elite Scientist Sponsorship Program by CAST, 2022.

Excellent Doctoral Dissertation Award of CAAI, 2021.

Excellent PhD Graduate of Beijing, 2020.

Excellent Doctoral Dissertation of Tsinghua University, 2020.

Zijing Scholar Fellowship for Prospective Researcher, Tsinghua University, 2020.

Group

PhD students:

Zhiheng Li (2021-; with Prof. Jie Zhou)	Yixuan Zhu (2022-; with Prof. Jie Zhou)
Yong Liu (2023-)	Shiyi Zhang (2023-)
Xin Dong (2023-; with Pengcheng Lab)	Chubin Zhang (2024-)
Changyuan Wang (2024-)	Haipeng Luo (2024-; with Pengcheng Lab)

Master students (who have their homepages):

Guanxing Lu (2023-)	Wenxun Dai (2023-)
Sule Bai (2023-)	Xubing Ye (2023-)
Yiqin Wang (2023-)	Jixuan Fan (2023-)
Haoji Zhang (2024-)	Ao Li (2024-)
Yuji Wang (2024-)

Alumni

Xiaoke Huang (2022-2024, now PhD student at UCSC)	Jinpeng Liu (2022-2025, now PhD student at NUS)
Yiji Cheng (2022-2025, now at Tencent)	Sujia Wang (2022-2025, now at Insta360)
Wenjia Geng (2022-2025, now at Meituan)	Yunzhi Teng (2022-2025, now at China Southern Power Grid)

Academic Services

Associate Editor: JVCI

Area Chair: CVPR 2025, FG 2023

Conference Reviewer: CVPR, ICCV, ECCV, AAAI and so on

Journal Reviewer: TPAMI, TIP, TMM, TCSVT and so on

Website Template