3DCrowdNet_RELEASE icon indicating copy to clipboard operation
3DCrowdNet_RELEASE copied to clipboard

some question about process 2d datasets

Open yuchen-ji opened this issue 2 years ago • 3 comments

i found that you crop the closest person for 3d datasets if have multiple person in one image, but for 2d dataset you may crop all persons in one image. why you process these datasets differently? in addition, the cropped image may contain another person, won't this process bring ambiguity to the network? thank you very much!

yuchen-ji avatar Jul 05 '22 07:07 yuchen-ji

i found that you crop the closest person for 3d datasets if have multiple person in one image, but for 2d dataset you may crop all persons in one image. why you process these datasets differently?

The cropping process is the same regardless of datasets. Even if you crop the closest person for 3d datasets, there could be other people in the cropped image. And actually MuCo, which you are referring to, does not have multiple person in one image originally. It synthesizes multiple real person images to one image using depths.

in addition, the cropped image may contain another person, won't this process bring ambiguity to the network?

That is the challenge of crowded scenes, which 3DCrowdNet resolves. Please see the paper.

hongsukchoi avatar Jul 05 '22 14:07 hongsukchoi

i found that you crop the closest person for 3d datasets if have multiple person in one image, but for 2d dataset you may crop all persons in one image. why you process these datasets differently?

The cropping process is the same regardless of datasets. Even if you crop the closest person for 3d datasets, there could be other people in the cropped image. And actually MuCo, which you are referring to, does not have multiple person in one image originally. It synthesizes multiple real person images to one image using depths.

in addition, the cropped image may contain another person, won't this process bring ambiguity to the network?

That is the challenge of crowded scenes, which 3DCrowdNet resolves. Please see the paper.

Thanks for your reply! I found many up-to-down methods using single person datasets for training, these can prevent ambiguity during training, which the cropped image does not contain another person. while for inference, the cropped image often contain other persons, but it often regress the right person's smpl parameters. Does this mean that if other people are included in the cropped image during training, it will bring ambiguity to the network and make it difficult for training. In 3DCrowdNet, the cropped image contain other persons even for training. but add 2d robust pose heatmap to resolve the ambiguity. Is my understanding correct?

yuchen-ji avatar Jul 05 '22 14:07 yuchen-ji

Does this mean that if other people are included in the cropped image during training, it will bring ambiguity to the network and make it difficult for training.

No. I think you are confused with how deep learning works. Given accurate ground truth, a neural network becomes robust to the ambiguity during training. Then, the neural network performs better on those ambiguous input in test time.

hongsukchoi avatar Jul 06 '22 14:07 hongsukchoi