occlusion_person icon indicating copy to clipboard operation
occlusion_person copied to clipboard

A dataset for multiview 3D human pose estimation with detailed occlusion labels, powered by UnrealCV

Occlusion-Person Dataset

Overview

This dataset is part our work AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the Wild which is published on IJCV. The paper is available at (arXiv:2010.13302).

humans

Fig 1 Human models we used in Occlusion-Person

The previous benchmarks do not provide occlusion labels for the joints in images which prevents us from performing numerical evaluation on the occluded joints. In addition, the amount of occlusion in the benchmarks is limited. To address the limitations, we propose to construct this synthetic dataset Occlusion-Person. We adopt UnrealCV to render multiview images and depth maps from 3D models.

In particular, thirteen human models of different clothes are put into nine different scenes such as living rooms, bedrooms and offices. The human models are driven by the poses selected from the CMU Motion Capture database. We purposely use objects such as sofas and desks to occlude some body joints. Eight cameras are placed in each scene to render the multiview images and the depth maps. We provide the 3D locations of 15 joints as ground truth.

The occlusion label for each joint in an image is obtained by comparing its depth value (available in the depth map), to the depth of the 3D joint in the camera coordinate system. If the difference between the two depth values is smaller than 30cm, then the joint is not occluded. Otherwise, it is occluded. The table below compares this dataset to the existing benchmarks. In particular, about 20% of the body joints are occluded in our dataset.

Dataset Frames Cameras Occluded Joints
Human3.6M 784k 4 -
Total Capture 236k 8 -
Panoptic 36k 31 -
Occlusion-Person 73k 8 20.3%

occluded_joints

Fig 2 Some typical images, ground-truth 2D joint locations and the depth maps. The joint represented by red x means it is occluded.

Download & Extract

Manually download images.zip

(We now provide a script to automatically download the data. Please see Next Sec. This section can be skipped.)

Please manually download from OneDrive to a folder, e.g. ./data, for now. (We are working on writing a script to automatically fetch the data). We split it into 53 parts due to per file size limit of OneDrive. Each part is about 1GB.

After all parts are fully downloaded, you should have files like this:

data
├── occlusion_person.zip.001
├── occlusion_person.zip.002
├── ...
├── occlusion_person.zip.053

You can run find ./data -type f | xargs md5sum > downloaded_checksum.txt to generate all MD5 checksum for the files (this may take long time). Then compare it to our pre-generated checksum file checksum.txt by diff checksum.txt downloaded_checksum.txt.

Then, extract the images.zip by 7z x ./data/occlusion_person.zip.001. You should have images.zip in current directory.

Automatically download images.zip

Please run the script using python 3:

pip install python3-wget
python download.py

Download annotations

We also provide the train/val annotation files used in our experiments at OneDrive/annot.

Finally, organize the images and annotations into below structure:

unrealcv
├── images.zip
├── annot
    ├── unrealcv_train.pkl
    ├── unrealcv_validation.pkl

All done.

Data Description

An annotation (.pkl) contains a list of items. Each item is associated with an image by the "image" attribute.

We list all the attributes below, and describe their meanings.

  • image: str, e.g. a05_sa01_s14_sce00/00_000000.jpg | the path to the associated image file
  • joints_2d: ndarray (15,2) | 2D ground-truth joint location (x, y) in image frame
  • joints_3d: ndarray (15,3) | 3D ground-truth joint location (x, y, z) in camera frame
  • joints_gt: ndarray (15,3) | 3D ground-truth joint location (x, y, z) in global frame
  • joints_vis: ndarray (15,1) | indicating if the joint is within the image boundary
  • joints_vis_2d: ndarray (15,1) | indicating if the joint is within the image boundary and not occluded
  • center: (2,) | ground-truth bounding box center in image frame
  • scale: (2,) | ground-truth bounding box size in image frame (multiply the number with 200)
  • box: (4,) | ground-truth bounding box (top-left and bottom-right coordinates, can also be inferred from center and scale)
  • video_id: str, e.g. a05_sa01_s14_sce00
  • image_id: int | video_id and image_id can be used to differentiate frames
  • subject: int
  • action: int
  • subaction: int
  • camera_id: int | 0-7 in this dataset
  • camera: dict | camera extrinsic and intrinsic parameters, note that the T definition is different. For detailed information please refer to our release code (TODO)
  • source: str | an alias for the dataset name, will be same across a dataset

Citation

If you use this dataset, please consider citing our work.

@article{zhang2020adafuse,
      title={AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the Wild}, 
      author={Zhe Zhang and Chunyu Wang and Weichao Qiu and Wenhu Qin and Wenjun Zeng},
      year={2020},
      journal={IJCV},
      publisher={Springer},
      pages={1--16},
}