FLAG3D: A 3D Fitness Activity Dataset with Language Instruction

CVPR 2023

Yansong Tang*, ✝, 1

Jinpeng Liu*, 1

Aoyang Liu*, 1

Bin Yang1

Wenxun Dai1

Yongming Rao2

Jiwen Lu◊, 2

Jie Zhou2

Xiu Li◊, 1

* equal contribution, ✝ project lead, ◊ corresponding authors

{1Shenzhen International Graduate School, 2Department of Automation}, Tsinghua University



With the continuously thriving popularity around the world, fitness activity analytic has become an emerging research topic in computer vision. While a variety of new tasks and algorithms have been proposed recently, there are growing hunger for data resources involved in high-quality data, fine-grained labels, and diverse environments. In this paper, we present FLAG3D, a large-scale 3D fitness activity dataset with language instruction containing 180K sequences of 60 categories. FLAG3D features the following three aspects: 1) accurate and dense 3D human pose captured from advanced MoCap system to handle the complex activity and large movement, 2) detailed and professional language instruction to describe how to perform a specific activity, 3) versatile video resources from a high-tech MoCap system, rendering software, and cost-effective smartphones in natural environments. Extensive experiments and in-depth analysis show that FLAG3D contributes great research value for various challenges, such as cross-domain human action recognition, dynamic human mesh recovery, and language-guided human action generation. Our dataset and source code will be publicly available.


1. Overview of FLAG3D Dataset

FLAG3D contains 180K videos of 60 daily fitness activities. Our dataset is comprised of (a) 3D activity sequences captured from advanced MoCap system, (b) rendered videos of different people with their SMPL parameters, and (c) real-world videos obtained by cost-effective phones from both indoor and outdoor natural environments. FLAG3D also provides a series of detailed and professional sentence-level language instructions for each fitness activity. All figures are best viewed in color.

2. Illustration of the Taxonomy

FLAG3D is systematically organized in three levels as body part, fitness activity and language instruction. This figure details a concrete example of the “Squat With Alternate Knee Lift” activity that is mainly driven by the quadriceps femoris muscle of the “Leg”, while the corresponding language instructions are shown in the left.

3. Dataset Gallery

Several examples from FLAG3D Dataset. From left to right we display the MoCap data, Original Rendered RGB Videos, and Rendered RGB Videos with SMPL Mesh fitting results.

4. Action Classes

In total, there are 60 kinds of actions. Selected activities exercise most parts of our body, including the chest, back, shoulder, arm, neck, abdomen, waist, hip, and leg.


FLAG3D is a 3D fitness activity dataset composed of the following parts:
• Skeleton
• Language
• Video from Nature Scene
• Rendering Video (subset): We share a subset of it which contains 1800 videos because of the data size. If you needs more, please email us.
• Raw Data: Raw data from MoCap software. You can work with it in rendering software like Unity.

If interested, please ask your advisor or the representative of your organization to sign this license agreement, and send the electronic scan to liujp22@mails.tsinghua.edu.cn to obtain the dataset. For more information, please refer to our paper.


If you find our project useful, please consider citing us:

  title={FLAG3D: A 3D Fitness Activity Dataset with Language Instruction},
  author={Yansong Tang and Jinpeng Liu and Aoyang Liu and Bin Yang and Wenxun Dai and Yongming Rao and Jiwen Lu and Jie Zhou and Xiu Li},