ROMP icon indicating copy to clipboard operation
ROMP copied to clipboard

Training questions

Open ekuznetsov139 opened this issue 3 years ago • 1 comments

I'm trying to reproduce BEV training results, and I have a few questions.

  1. The source tree references a "posetrack" dataset. It is not mentioned in the paper, its official web site is dead (as in, the domain name does not even resolve), and last activity in its maintainers' Twitter account was two years ago. For now, I've deleted it from the configs, but is there any other place where I can download it?
  2. The paper says that training was done on 4x V100; was that 16 GB or 32 GB? I am currently training on 2x 1080Ti (12 GB each), and I can only go up to batch size 18 without running out of GPU memory. I assume that you can do 64 because you have twice as many GPUs (so, your 64 is 16 per GPU)? Should I adjust the learning rate to compensate?
  3. The paper describes a two-step training strategy: "We first learn monocular 3D pose and shape estimation for 120 epochs on basic training datasets. Then we add the weak annotations of RH to training samples and train for 120 epochs." Does the v6_train.sh script correspond to the first step or the second step?
  4. How can I tell if it is training correctly? What kind of "Losses" should I see after one or two epochs? How long did it take to train?
  5. Do I understand correctly that the model assumes that all humans are seen at fairly low field-of-view angles (the paper mentions 60 degrees)? I tried the pretrained checkpoint on some wide-angle photos with 90 degree FOV and the results weren't very satisfactory.
  6. Is the "SMIL infant template" the same one that's available for download on the AGORA web site? (I've managed to import that one into Blender and to have a look at it, but I can't figure out how to do the same with your SMPLA_NEUTRAL.pth.) If it is, are you aware of its defects? Most notably, its hands are clenched into fists rather than flat, lips and eyeballs are messed up, and feet are way too small (I'd say they need to be about 50% larger.)

ekuznetsov139 avatar Oct 10 '22 16:10 ekuznetsov139

Thanks for your interests in BEV. Good questions!!! @ekuznetsov139 1.about the posetrack dataset: the dataset publishers have made a new website to apply for posetrack: https://github.com/anDoer/PoseTrack21 2. The model was trained using 4 16GB V100. To achieve faster training with limited GPU memory, we highly recommand to fine-tune the pretrained model, like using backbone of ROMP, which share the same structure. From previous attempts by other developers, the learning rate is fine. They can train ROMP well with current value. 3. v6_train.sh script is corresponded to the second step. As said above, please fine-tune the pretrained model, like using backbone of ROMP. Otherwise, the training process may last weeks. 4. You can tell the training state through the performance on validation sets (e.g. on 3DPW). It may need 2 days for fine-tunning or more than 2 weeks to train from scratch. 5. Yes, we take a fixed FOV to make all predictions share the same/stable camera model for supervision. This is one of the limitation of BEV. While developing BEV, due to limited condition and data we have, we are forced to make many trade-offs like this. 6. Yes, SMIL infant template is obtained from AGORA. I mixed the data of SMIL into the SMPL model data to get SMPLA_NEUTRAL.pth. Thanks for pointing out the limitation of such design. And yes, the result mesh of some babies are not good. We do need a new model to support all ages. Sorry that we are not able to dig deeper in this direction.

Sorry for the late reply. I was trapped in a deadline. Best, Yu

Arthur151 avatar Oct 20 '22 04:10 Arthur151