I2L-MeshNet_RELEASE icon indicating copy to clipboard operation
I2L-MeshNet_RELEASE copied to clipboard

Cannot reproduce training performance

Open rawalkhirodkar opened this issue 2 years ago • 16 comments

Hi Gyeongsik,

I am working on reproducing the numbers reported in the paper. Train dataset: H36M, MuCo, COCO Test dataset: 3DPW

I am using pytorch 1.8, python 3.8, cuda10


I did two runs. Here is the performance of snapshot12.pth on 3DPW dataset (last checkpoint of lixel stage)

  1. Train Batch Size per GPU = 16, Number of GPUs = 4 (this is the default config)
MPJPE from lixel mesh: 96.23 mm
PA MPJPE from lixel mesh: 60.68 mm
  1. Train Batch Size per GPU = 24, Number of GPUs = 8 (bigger batch config)
MPJPE from lixel mesh: 96.37 mm
PA MPJPE from lixel mesh: 61.51 mm

I also trained the bigger batch config (run2) for the param stage. Here is the performance snapshot17.pth and snapshot15.pth (the best checkpoint) on 3DPW dataset.

snapshot17.pth, param stage
MPJPE from lixel mesh: 95.85 mm
PA MPJPE from lixel mesh: 61.21 mm
MPJPE from param mesh: 98.11 mm
PA MPJPE from param mesh: 61.64 mm
snapshot15.pth, param stage
MPJPE from lixel mesh: 95.65 mm
PA MPJPE from lixel mesh: 60.97 mm
MPJPE from param mesh: 97.22 mm
PA MPJPE from param mesh: 60.82 mm

I am still waiting on the param stage of the default config, will edit this then. But the reported MPJPE for lixel is 93.2 and it looks unlikely that I will converge there. Any suggestions? Should I train longer?

Thank you would greatly appreciate your help.

rawalkhirodkar avatar Sep 20 '21 14:09 rawalkhirodkar

You don't have to train longer. Could you let me know any modifications you made on top of the released codes?

mks0601 avatar Sep 21 '21 04:09 mks0601

Thank you for the reply. No modifications, right offshelf.

I was also able to reproduce the reported results with the weights shared, so the test data setup is correct.

rawalkhirodkar avatar Sep 21 '21 05:09 rawalkhirodkar

Did you train the model with the provided SMPLify-X fits of Human3.6M, MSCOCO, and MuCo?

mks0601 avatar Sep 22 '21 12:09 mks0601

Yes, as I was reproducing results, made sure everything else is identical including the data setup. Here is the param stage train log (lixel stage log is too big to attach here). param_stage_train.txt .

rawalkhirodkar avatar Sep 22 '21 15:09 rawalkhirodkar

That is weird.. I tried training this model few months ago and I successfully reproduced the numbers of the paper. The PA MPJPEs of your trained models are too high.

mks0601 avatar Sep 22 '21 17:09 mks0601

Could you check MPJPE and PA MPJPE of all snapshots of the lixel stage? It seems you only checked the last snapshot.

mks0601 avatar Sep 22 '21 17:09 mks0601

Thank you for the suggestion. Here is the performance all snapshots end of epoch 13 to 5 in the lixel stage.

snapshot 12
MPJPE from lixel mesh: 96.23 mm, PA MPJPE from lixel mesh: 60.68 mm
MPJPE from param mesh: 476.56 mm, PA MPJPE from param mesh: 312.22 mm
snapshot 11
MPJPE from lixel mesh: 96.32 mm, PA MPJPE from lixel mesh: 61.05 mm
MPJPE from param mesh: 476.17 mm, PA MPJPE from param mesh: 311.95 mm
snapshot 10
MPJPE from lixel mesh: 97.20 mm, PA MPJPE from lixel mesh: 60.99 mm
MPJPE from param mesh: 476.03 mm, PA MPJPE from param mesh: 312.18 mm
snapshot 9
MPJPE from lixel mesh: 99.54 mm, PA MPJPE from lixel mesh: 62.00 mm
MPJPE from param mesh: 475.52 mm, PA MPJPE from param mesh: 313.29 mm
snapshot 8
MPJPE from lixel mesh: 95.19 mm, PA MPJPE from lixel mesh: 59.96 mm
MPJPE from param mesh: 476.22 mm, PA MPJPE from param mesh: 312.58 mm
snapshot 7
MPJPE from lixel mesh: 100.16 mm, PA MPJPE from lixel mesh: 61.76 mm
MPJPE from param mesh: 475.57 mm, PA MPJPE from param mesh: 313.10 mm
snapshot 6
MPJPE from lixel mesh: 98.81 mm, PA MPJPE from lixel mesh: 61.52 mm
MPJPE from param mesh: 476.44 mm, PA MPJPE from param mesh: 312.91 mm
snapshot 5
MPJPE from lixel mesh: 95.52 mm, PA MPJPE from lixel mesh: 60.33 mm
MPJPE from param mesh: 475.54 mm, PA MPJPE from param mesh: 312.57 mm
snapshot 4
MPJPE from lixel mesh: 100.93 mm, PA MPJPE from lixel mesh: 61.85 mm
MPJPE from param mesh: 474.39 mm, PA MPJPE from param mesh: 314.40 mm

rawalkhirodkar avatar Sep 22 '21 19:09 rawalkhirodkar

If this line and this line are the same with the pushed ones, I guess there would be no problem in your codes. The PA MPJPEs of snapshots are pretty weird. It seems the modules are not trained because the errors do not change. Could you check results of the default setting, not the bigger batch version?

mks0601 avatar Sep 23 '21 00:09 mks0601

The results are for the default setting and not the bigger batch versions. I am using the old get_optimizer bypassing the trainable_modules. This should not make a difference right?

def get_optimizer(self, model):
        if cfg.stage == 'lixel':
            optimizer = torch.optim.Adam(list(model.module.pose_backbone.parameters()) + \
                                        list(model.module.pose_net.parameters()) + \
                                        list(model.module.pose2feat.parameters()) + \
                                        list(model.module.mesh_backbone.parameters()) + \
                                        list(model.module.mesh_net.parameters()), lr=cfg.lr)
            print('The parameters of pose_backbone, pose_net, pose2feat, mesh_backbone, and mesh_net are added to the optimizer.')
        else:
            optimizer = torch.optim.Adam(model.module.param_regressor.parameters(), lr=cfg.lr)
            print('The parameters of all modules are added to the optimizer.')
        return optimizer

rawalkhirodkar avatar Sep 23 '21 00:09 rawalkhirodkar

I don't think changing those lines to newer ones would make some changes, but could you try? If you still cannot reproduce the results, well... I can't come up with new solutions.

mks0601 avatar Sep 23 '21 11:09 mks0601

Thank you for the suggestion. I did a fresh clone. The current code on this repo throws this error (one of the reasons I switched to the older versions)

  File "train.py", line 83, in <module>
    main()
  File "train.py", line 40, in main
    trainer._make_model()
  File "/Desktop/ochmr/lixel_original/main/../common/base.py", line 129, in _make_model
    optimizer = self.get_optimizer(model)
  File "/Desktop/ochmr/lixel_original/main/../common/base.py", line 55, in get_optimizer
    optimizer = torch.optim.Adam(total_params, lr=cfg.lr)
  File "/anaconda3/envs/lixel/lib/python3.8/site-packages/torch/optim/adam.py", line 48, in __init__
    super(Adam, self).__init__(params, defaults)
  File "/anaconda3/envs/lixel/lib/python3.8/site-packages/torch/optim/optimizer.py", line 55, in __init__
    self.add_param_group(param_group)
  File "/anaconda3/envs/lixel/lib/python3.8/site-packages/torch/optim/optimizer.py", line 255, in add_param_group
    raise TypeError("optimizer can only optimize Tensors, "
TypeError: optimizer can only optimize Tensors, but one of the params is Module.parameters

rawalkhirodkar avatar Sep 25 '21 02:09 rawalkhirodkar

Sorry I changed common/base.py Now it gonna work

mks0601 avatar Sep 25 '21 02:09 mks0601

I am working on reproducing the numbers reported in the paper. Train dataset: H36M, MuCo, COCO Test dataset: 3DPW I am using pytorch 1.6, python 3.7, cuda10

Here is the performance of snapshot12.pth on 3DPW dataset (last checkpoint of lixel stage) MPJPE from lixel mesh: 99.11 mm PA MPJPE from lixel mesh: 58.80 mm MPJPE from param mesh: nan mm PA MPJPE from param mesh: nan mm

zhLawliet avatar Oct 22 '21 03:10 zhLawliet

I am working on reproducing the result fo 3DPW. Train dataset: H36M, COCO Test dataset: 3DPW lr_dec_epoch = [10,12] end_epoch = 13 lr = 1e-4

The performance is as follow and cannot reach the performance in the paper: MPJPE from lixel mesh:99.05 mm PA MPJPE from lixel mesh: 62.68 mm

I wonder the training settings are all the same even if I use more data such as MuCo? Or I should use different training setting?

Cakin-Kwong avatar Mar 04 '22 05:03 Cakin-Kwong

You don't have to train longer. Could you let me know any modifications you made on top of the released codes?

Hi @mks0601 , I thought conventional training epochs are more than 70 or 100 epochs. Why does the code run 10 epochs, much less? Thanks.

GloryyrolG avatar Jan 14 '23 05:01 GloryyrolG

We found that the longer training is not necessary

mks0601 avatar Jan 14 '23 08:01 mks0601