Yansong Tang
I am a tenure-track Assistant Professor of Tsinghua-Berkeley Shenzhen Institute, Tsinghua University, where I direct the IVG@SZ (Intelligent Vision Group at Shenzhen, the sister group of the IVG at Beijing). Before that, I was a postdoctoral researcher at the Department of Engineering Science of the University of Oxford, working with Prof. Philip H. S. Torr and Prof. Victor Prisacariu. My research interests lie in computer vision. Currently, I am working in the fields of video analytics, vision-language understanding and 3D reconstruction.
I received my Ph.D degree with honour at Tsinghua University, advised by Prof. Jie Zhou and Prof. Jiwen Lu, and B.S. degree in Automation from Tsinghua University. I have also spent time at Visual Computing Group of Microsoft Research Asia (MSRA), and Prof. Song-Chun Zhu’s VCLA lab of University of California, Los Angeles (UCLA).
I am looking for self-motivated Master/PhD/Postdoc. If you have top grades or coding skill, and are highly creative and interested in joining my group, please do not hesitate to send me your CV and transcripts of grades.
Email  / 
Google Scholar  / 
GitHub
|
|
News
2022-04: A talk at MSRA about LAVT.
2022-03: Five papers to appear in CVPR 2022.
|
Selected Publications
(*Equal Contribution, #Corresponding Author)
|
|
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation
Zhao Yang*, Jiaqi Wang*, Yansong Tang#, Kai Chen, Hengshuang Zhao, Philip H.S. Torr
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
[arxiv]
[code]
We present an end-to-end hierarchical Transformer-based network for referring segmentation.
|
|
BNV-Fusion: Dense 3D Reconstruction using Bi-level Neural Volume Fusion
Kejie Li, Yansong Tang, Victor Adrian Prisacariu, Philip H.S. Torr
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
[arxiv]
[code]
We present Bi-level Neural Volume Fusion, which leverages recent advances in neural implicit representations and neural rendering for dense 3D reconstruction. In order to incrementally integrate new depth maps into a global neural implicit representation, we propose a novel bi-level fusion strategy that considers both efficiency and reconstruction quality by design.
|
|
DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
Yongming Rao*, Wenliang Zhao*, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, Jiwen Lu
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
[arxiv]
[code]
DenseCLIP is a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
|
|
Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based Action Recognition
Yansong Tang*, Xingyu Liu*, Xumin Yu, Danyang Zhang, Jiwen Lu, and Jie Zhou
ACM Transactions on Multimedia Computing Communications and Applications (TOMM), 2022
[pdf]
[code]
We devise a temporal-spatial Cubism strategy, which guides the network to be aware of the permutation of the segments in the temporal domain and the body parts in the spatial domain separately, thus improves the generalization ability of the model for cross-dataset action recognition.
|
|
Comprehensive Instructional Video Analysis: The COIN Dataset and Performance Evaluation
Yansong Tang, Jiwen Lu, and Jie Zhou
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021
[arXiv]
[Project Page]
Journal version of the COIN dataset.
|
|
Breaking Shortcut: Exploring Fully Convolutional Cycle-Consistency for Video Correspondence Learning
Yansong Tang*, Zhenyu Jiang*, Zhenda Xie*, Yue Cao, Zheng Zhang, Philip H. S. Torr, Han Hu
ICCV SRVU workshop, 2021
[arxiv]
[code]
We observe a collapse phenomenon when directly applying fully convolutional cycle-consistency method for video correspondence learning, study the underline reason behind it, and propose a spatial transformation approach to address this issue.
|
|
Hierarchical Interaction Network for Video Object Segmentation from Referring Expressions
Zhao Yang*, Yansong Tang*, Luca Bertinetto, Hengshuang Zhao, Philip H. S. Torr
British Machine Vision Conference (BMVC), 2021
[arxiv]
[code] (to come)
We present an end-to-end hierarchical interaction network for video object segmentation from referring expressions, which leverages the feature pyramid produced by the visual encoder to generate multiple levels of multi-modal features.
|
|
Uncertainty-aware Score Distribution Learning for Action Quality Assessment
Yansong Tang*, Zanlin Ni*, Jiahuan Zhou, Danyang Zhang, Jiwen Lu, Ying Wu, and Jie Zhou
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020
Oral Presentation
[arxiv]
[Code]
We propose an uncertainty-aware score distribution learning method and extend it to a multi-path model for action quality assessment.
|
|
Graph Interaction Networks for Relation Transfer in Human Activity Videos
Yansong Tang, Yi Wei, Xumin Yu, Jiwen Lu, and Jie Zhou
IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2020
[PDF]
[code] (to come)
We propose a graph interaction networks (GINs) model for transferring relation knowledge across two graphs two different scenarios for video
analysis, including a new proposed setting for unsupervised skeleton-based action recognition across different datasets, and supervised group activity recognition with multi-modal inputs.
|
|
Learning Semantics-Preserving Attention and Contextual Interaction for Group Activity Recognition
Yansong Tang, Jiwen Lu, Zian Wang, Ming Yang, and Jie Zhou
IEEE Transaction on Image Processing (TIP), 2019
[PDF]
[Supp]
We extend of our Semantics-Preserving Attention model with graph convolutional module for group activity recognition.
|
|
COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis
Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, Jie Zhou
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019
[arXiv] [Project Page] [Annotation Tool]
COIN is one of the largest and most comprehensive instructional video analysis datasets with rich annotations.
|
|
Multi-stream Deep Neural Networks for RGB-D Egocentric Action Recognition
Yansong Tang, Zian Wang, Jiwen Lu, Jianjiang Feng, and Jie Zhou
IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2019
[PDF]
[Project Page]
[Code]
We propose a multi-stream deep neural networks and THU-READ dataset for RGB-D egocentric action recognition.
|
|
Mining Semantics-Preserving Attention for Group Activity Recognition
Yansong Tang, Zian Wang, Peiyang Li, Jiwen Lu, Ming Yang, and Jie Zhou
ACM Multimedia (MM), 2018
Oral Presentation
[PDF]
We present a simple yet effective semantics-preserving attention module for group activity recognition.
|
|
Deep Progressive Reinforcement Learning for Skeleton-based Action Recognition
Yansong Tang*, Yi Tian*, Jiwen Lu, Peiyang Li, and Jie Zhou
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018
[PDF]
We propose a simple yet effective method to select key frames for skeleton-based action recognition using the REINFORCE algorithm.
|
Selected Honors and Awards
Excellent Doctoral Dissertation Award of CAAI, 2021.
Excellent PhD Graduate of Beijing, 2020.
Excellent Doctoral Dissertation of Tsinghua University, 2020.
Zijing Scholar Fellowship for Prospective Researcher, Tsinghua University, 2020.
|
Academic Services
Area Chair: FG 2023
Conference Reviewer: CVPR, ICCV, ECCV, AAAI and so on
Journal Reviewer: TPAMI, TIP, TMM, TCSVT and so on
|
|