Yansong Tang

I am a tenure-track Assistant Professor of Shenzhen International Graduate School, Tsinghua University, where I direct the IVG@SZ (Intelligent Vision Group at Shenzhen, the sister group of the IVG at Beijing). Before that, I was a postdoctoral researcher at the Department of Engineering Science of the University of Oxford, working with Prof. Philip H. S. Torr. My current research interests lie in computer vision and pattern recognition.

I received my B.S. degree and Ph.D degree with honour from Tsinghua University, advised by Prof. Jie Zhou and Prof. Jiwen Lu. I have also spent time at Prof. Song-Chun Zhu's lab at University of California, Los Angeles (UCLA), and Microsoft Research Asia (MSRA), hosted by Dr. Han Hu and Dr. Xin Tong respectively.

I am looking for self-motivated Master/PhD/Postdoc. If you have top grades or coding skill, and are highly creative and interested in joining my group, please do not hesitate to send me your CV and transcripts of grades by Email after reading this file.

profile photo
News

  • 2024-07: We are organizing "VENUE", the only full-day tutorial in ECCV 2024: Recent Advances in Video Content Understanding and Generation.
  • 2024-06: We have won the Long-form Video Question Answering Challenge of the CVPR2024 LOVEU Workshop.
  • 2024-06: We have organized the "MANGO" Workshop in CVPR 2024: New Trends in Multimodal Human Action Perception, Understanding and Generation. The recorded video will be released soon.
  • 2024-03: Ten papers are accepted to CVPR 2024, including an oral presentation and a highlight poster.
  • Recent Selected Publications [ Full List ]

    (*Equal Contribution, #Corresponding Author)

    dise Language-Aware Vision Transformer for Referring Segmentation
    Zhao Yang*, Jiaqi Wang*, Xubing Ye*, Yansong Tang#, Kai Chen, Hengshuang Zhao, Philip H.S. Torr
    IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
    [Paper] [Code] [Conference Version]

    We propose LAVT, a Transformer-based universal referring image and video segmentation (RIS and RVOS) framework that performs language-aware visual encoding in place of cross-modal fusion post feature extraction.

    dise Comprehensive Instructional Video Analysis: The COIN Dataset and Performance Evaluation
    Yansong Tang, Jiwen Lu, and Jie Zhou
    IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021
    [arXiv] [Project Page] [中文解读] [Conference Version]

    COIN is currently the largest and most comprehensive instructional video analysis datasets with rich annotations.

    dise GeoLRM: Geometry-Aware Large Reconstruction Model for High-Quality 3D Gaussian Generation
    Chubin Zhang, Hongliang Song, Yi Wei, Chen Yu, Jiwen Lu, Yansong Tang#
    Thirty-eighth Conference on Neural Information Processing Systems, (NeurIPS), 2024
    [arXiv] [Code] [Project Page]

    This paper proposes a geometry-aware large reconstruction model for sparse-view reconstruction and 3D generation.

    dise Q-VLM: Post-training Quantization for Large Vision-Language Models
    Changyuan Wang, Ziwei Wang, Xiuwei Xu, Yansong Tang#, Jie Zhou, Jiwen Lu
    Thirty-eighth Conference on Neural Information Processing Systems, (NeurIPS), 2024
    [arXiv] [Code]

    This paper aims to achieve efficient inference and memory saving for large vision-language models.

    dise ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation
    Guanxing Lu, Shiyi Zhang, Ziwei Wang, Changliu Liu, Jiwen Lu and Yansong Tang
    The European Conference on Computer Vision (ECCV), 2024
    [arXiv] [Code] [Project Page]

    We propose a dynamic Gaussian Splatting method named ManiGaussian for multi-task robotic manipulation, which mines scene dynamics via future scene reconstruction.

    dise MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model
    Wenxun Dai, Linghao Chen, Jingbo Wang, Jinpeng Liu, Bo Dai, Yansong Tang
    The European Conference on Computer Vision (ECCV), 2024
    [arXiv] [Code] [Project Page]

    We introduces MotionLCM, extending controllable motion generation to a real-time level.

    dise Plan, Posture and Go: Towards Open-Vocabulary Text-to-Motion Generation
    Jinpeng Liu*, Wenxun Dai*, Chunyu Wang*, Yiji Cheng, Yansong Tang#, Xin Tong
    The European Conference on Computer Vision (ECCV), 2024
    [arXiv] [Project Page]

    We present a divide-and-conquer framework named PRO-Motion, which consists of three modules as motion planner, posture-diffuser and go-diffuser.

    dise Segment and Caption Anything
    Xiaoke Huang, Jianfeng Wang, Yansong Tang#, Zheng Zhang, Han Hu, Jiwen Lu, Lijuan Wang, Zicheng Liu
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
    [arXiv] [Code] [Project Page] [Tsinghua Twitter]

    We propose a method to efficiently equip the Segment Anything Model (SAM) with the ability to generate regional captions by introducing a lightweight query-based feature mixer.

    dise Universal Segmentation at Arbitrary Granularity with Language Instruction
    Yong Liu, Cairong Zhang, Yitong Wang, Jiahao Wang, Yujiu Yang, Yansong Tang#
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
    [arXiv] [Code]

    We propose a unified framework to achieve universal segmentation at a wide spectrum of granularities and levels.

    dise Open-Vocabulary Segmentation with Semantic-Assisted Calibration
    Yong Liu*, Sule Bai*, Guanbin Li, Yitong Wang, Yansong Tang#
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
    [arXiv] [Code]

    We propose an open-vocabulary segmentation (OVS) method by calibrating in-vocabulary and domain-biased embedding space with generalized contextual prior of CLIP.

    dise Towards Accurate Data-free Quantization for Diffusion Models
    Changyuan Wang, Ziwei Wang, Xiuwei Xu, Yansong Tang#, Jie Zhou, Jiwen Lu
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
    Highlight
    [arXiv] [Code]

    We propose an accurate data-free post-training quantization framework of diffusion models (ADP-DM) for efficient image generation.

    dise Narrative Action Evaluation with Prompt-Guided Multimodal Interaction
    Shiyi Zhang*, Sule Bai*, Guangyi Chen, Lei Chen, Jiwen Lu, Junle Wang, Yansong Tang#
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
    [arXiv] [Code]

    We investigate a new problem called narrative action evaluation (NAE) and propose a prompt-guided multimodal interaction framework.

    dise DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery
    Yixuan Zhu*, Ao Li*, Yansong Tang#, Wenliang Zhao, Jie Zhou, Jiwen Lu
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
    [arXiv] [Code] [Project Page]

    We propose a new method to exploit diffusion priors for human mesh recovery (HMR) in occlusion and crowded scenarios.

    dise FlowIE: Efficient Image Enhancement via Rectified Flow
    Yixuan Zhu, Wenliang Zhao, Ao Li, Yansong Tang, Jie Zhou, Jiwen Lu
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
    Oral Presentation
    [arXiv] [Code]

    We proposed a unified framework for various efficient image enhancement tasks with generative diffusion priors.

    dise MCUFormer: Deploying Vision Transformers on Microcontrollers with Limited Memory
    Yinan Liang, Ziwei Wang, Xiuwei Xu, Yansong Tang#, Jie Zhou, Jiwen Lu
    Thirty-seventh Conference on Neural Information Processing Systems, (NeurIPS), 2023.
    [arXiv] [Code] [中文解读]

    We propose a hardware-algorithm co-optimizations method called MCUFormer to deploy vision transformers on microcontrollers with extremely limited memory, where we jointly design transformer architecture and construct the inference operator library to fit the memory resource constraint.

    dise Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer
    Guangyi Chen*, Xiao Liu*, Guangrun Wang, Kun Zhang, Philip H.S. Torr, Xiao-Ping Zhang, Yansong Tang#
    IEEE International Conference on Computer Vision (ICCV), 2023
    [arXiv] [Project Page]

    We present Tem-Adapter, a method that improves VQA by leveraging image-based knowledge and introducing temporal and semantic aligners.

    dise Skip-Plan: Procedure Planning in Instructional Videos via Condensed Action Space Learning
    Zhiheng Li, Wenjia Geng, Muheng Li, Lei Chen, Yansong Tang#, Jiwen Lu, Jie Zhou
    IEEE International Conference on Computer Vision (ICCV), 2023
    [arXiv] [Code]

    We propose Skip-Plan, a condensed action space learning method for procedure planning in instructional videos.

    dise FLAG3D: A 3D Fitness Activity Dataset with Language Instruction
    Yansong Tang*, Jinpeng Liu*, Aoyang Liu*, Bin Yang, Wenxun Dai, Yongming Rao, Jiwen Lu, Jie Zhou, Xiu Li
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
    [arXiv] [Project Page]

    We present FLAG3D, a large-scale 3D fitness activity dataset with language instruction.

    dise LOGO: A Long-Form Video Dataset for Group Action Quality Assessment
    Shiyi Zhang, Wenxun Dai, Sujia Wang, Xiangwei Shen, Jiwen Lu, Jie Zhou, Yansong Tang#
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
    [PDF] [Project Page]

    LOGO is a new multi-person long-form video dataset for action quality assessment.

    dise HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions
    Yongming Rao*, Wenliang Zhao*, Yansong Tang, Jie Zhou, Ser-Nam Lim, Jiwen Lu
    Conference on Neural Information Processing Systems (NeurIPS), 2022
    [arXiv] [Code] [Project Page] [中文解读]

    HorNet is a family of generic vision backbones that perform explicit high-order spatial interactions based on Recursive Gated Convolution.

    dise OrdinalCLIP: Learning Probabilistic Ordinal Embeddings for Uncertainty-Aware Regression
    Wanhua Li*, Xiaoke Huang*, Zheng Zhu, Yansong Tang, Xiu Li, Jiwen Lu, Jie Zhou
    Conference on Neural Information Processing Systems (NeurIPS), 2022
    [arXiv] [Code] [Project Page] [中文解读]

    We present a language-powered paradigm for ordinal regression.

    dise LAVT: Language-Aware Vision Transformer for Referring Image Segmentation
    Zhao Yang*, Jiaqi Wang*, Yansong Tang#, Kai Chen, Hengshuang Zhao, Philip H.S. Torr
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
    [arxiv] [code] [中文解读]

    We present an end-to-end hierarchical Transformer-based network for referring segmentation.

    dise BNV-Fusion: Dense 3D Reconstruction using Bi-level Neural Volume Fusion
    Kejie Li, Yansong Tang, Victor Adrian Prisacariu, Philip H.S. Torr
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
    [arxiv] [code] [中文解读]

    We present Bi-level Neural Volume Fusion, which leverages recent advances in neural implicit representations and neural rendering for dense 3D reconstruction. In order to incrementally integrate new depth maps into a global neural implicit representation, we propose a novel bi-level fusion strategy that considers both efficiency and reconstruction quality by design.

    dise DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
    Yongming Rao*, Wenliang Zhao*, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, Jiwen Lu
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
    [arXiv] [Code] [Project Page] [中文解读]

    DenseCLIP is a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.

    dise Uncertainty-aware Score Distribution Learning for Action Quality Assessment
    Yansong Tang*, Zanlin Ni*, Jiahuan Zhou, Danyang Zhang, Jiwen Lu, Ying Wu, and Jie Zhou
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020
    Oral Presentation
    [arxiv] [Code]

    We propose an uncertainty-aware score distribution learning method and extend it to a multi-path model for action quality assessment.

    Dataset

  • COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis. [Website] [Download]
  • FLAG3D: A 3D Fitness Activity Dataset with Language Instruction. [Website] [Download]
  • THU-READ: Tsinghua University RGB-D Egocentric Action Dataset. [Website] [Download]
  • LOGO: A Long-Form Video Dataset for Group Action Quality Assessment. [Website] [Download]
  • Teaching

  • Reinforcement Learning, Spring 2024
  • Deep Learning: Frontier and Interdisciplinary Research, Fall 2023
  • Data Mining: Theory and Algorithms, Fall 2022 (with Prof. Xinlei Chen)
  • Selected Honors and Awards

  • Startrack Program by MSRA, 2023.
  • Young Elite Scientist Sponsorship Program by CAST, 2022.
  • Excellent Doctoral Dissertation Award of CAAI, 2021.
  • Excellent PhD Graduate of Beijing, 2020.
  • Excellent Doctoral Dissertation of Tsinghua University, 2020.
  • Zijing Scholar Fellowship for Prospective Researcher, Tsinghua University, 2020.
  • Group

  • PhD students:
    Zhiheng Li (2021-; with Prof. Jie Zhou) Yixuan Zhu (2022-; with Prof. Jie Zhou)
    Yong Liu (2023-) Shiyi Zhang (2023-)
    Xin Dong (2023-;with Pengcheng Lab) Chubin Zhang (2024-)
    Changyuan Wang (2024-) Haipeng Luo (2024-;with Pengcheng Lab)
  • Master students (who have their homepages):
    Jinpeng Liu (2022-) Guanxing Lu (2023-)
    Wenxun Dai (2023-) Sule Bai (2023-)
    Haoji Zhang (2024-) Ao Li (2024-)
  • Alumni

    Xiaoke Huang (2022-2024)

    Academic Services

  • Associate Editor: JVCI
  • Area Chair: CVPR 2025, FG 2023
  • Conference Reviewer: CVPR, ICCV, ECCV, AAAI and so on
  • Journal Reviewer: TPAMI, TIP, TMM, TCSVT and so on

  • Website Template