liangwx

Tsinghua University Beijing, China XiaoliangWang

Results 14 comments of


                                            liangwx

Evaluation codes

Waiting for the Evaluation code too! @vchoutas

Will the training code be released?

Waiting for the training code too! @vchoutas

how to get and use the pretrained models and reproduce the results of paper

请问pretrain_hrnet.pkl怎么使用呢，它里面的权重是只有backbone部分的参数，没有面向2d pose的head的参数吗？

how to get and use the pretrained models and reproduce the results of paper

请问您是怎么得到pretrain_hrnet.pkl的？我用sh scripts/V1_train.sh跑起来了，的确自动加载了pretrain_hrnet.pkl，请问大约多少个epoch可以得到类似ROMP_HRNet32_V1.pkl的效果？

how to get and use the pretrained models and reproduce the results of paper

sh scripts/V1_train.sh得到的log如下： [V1_hrnet_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g2,3.log](https://github.com/Arthur151/ROMP/files/8116261/V1_hrnet_h36m.mpiinf.coco.mpii.lsp.muco.crowdpose_g2.3.log) 最佳Evaluation表现在ln6166 ['Evaluation'] on local_rank 0 +-----------+-------+----------+ | DS/EM | MPJPE | PA_MPJPE | +-----------+-------+----------+ | pw3d_vibe | 90.99 | 52.50 | +-----------+-------+----------+ 而/ROMP_HRNet32_V1.pkl用eval_3dpw_test.yml得到的log如下： [eval_3dpw_test.log](https://github.com/Arthur151/ROMP/files/8115767/eval_3dpw_test.log) [eval_3dpw_test.yml.log](https://github.com/Arthur151/ROMP/files/8115862/eval_3dpw_test.yml.log) ['Evaluation'] on...

how to get and use the pretrained models and reproduce the results of paper

Q3：用默认load_pretrain_params读入的backbone参数（pretrain_hrnet.pkl），与通过model_path读入的backbone参数（pretrain_hrnet.pkl），难道不一样吗？下面是我修改成fine_tune=True，model_path=/path/to/pretrain_hrnet.pkl，之后的log： [myv2_train_pretrained_hrnet_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1.log](https://github.com/Arthur151/ROMP/files/8117164/myv2_train_pretrained_hrnet_h36m.mpiinf.coco.mpii.lsp.muco.crowdpose_g0.1.log) 明显看出loss较大，下降较慢，且由于validation初始值较大且下降较慢（仍然大于60），导致目前还没出现evaluation

how to get and use the pretrained models and reproduce the results of paper

感谢您及时热情的解答！ Q0：loss阶梯性的下降的原因是否是因为heatmap与param map的互相提高导致的？您是否方便提供一下您的tensorboard的events文件，以便做进一步交流。另外您“kernel的训练密度相对较低”，这个kernel具体指的是什么？ Q1：对比我的以上两个log，可以看出，pretrain_hrnet.pkl（1）使得第0个epoch的50step的loss初始值直接降了1092.66（2558.13-1465.47），（2）并且加速了loss下降过程，使得第一个evaluation的出现大约能提前至少37个epoch甚至更多，（3）使得避免了loss长期在200-300之间徘徊不能继续降低。说明了不论是收敛过程还是效果上限，pretrain_hrnet.pkl在ROMP训练中的非常重要，但是pretrain_hrnet.pkl（ pretrained on 2D pose estimation）本身需要多少训练的时间成本？是否值得先得到pretrain_hrnet.pkl，再得到一个拟合好的ROMP，这种训练顺序？ Q2: pretrain_hrnet.pkl结合从零开始训练的ROMP heads，经过2-3个epoch就能达到较好的PMPJPE数值52.50，是否能说明“backbone能提供正确的特征”比“heads”对ROMP整体在PMPJPE的表现贡献更大？ Q3：我看到您[这里](https://github.com/Arthur151/ROMP/issues/121#issuecomment-997549601)给的log，同样是从0开始训练，在第0~1个epoch里面就出现了evaluation，在9-10个epoch就得到了比较小的PMPJPE数值52.97。请问您用的是强化版本的训练代码吗？loss初始值和下降速度相比我这个加载backbone参数失败的训练实例都有了明显的改善，请问您重点优化了哪些方面？如果使用您强化版的训练代码，是否就可以摆脱对pretrain_hrnet.pkl的依赖？您的强化版训练代码什么时候可以放出？ Q4：我这几天刚意识到您的ROMP训练需要非常长的时间（非强化版训练代码从0开始训练需要多久？强化版训练代码从0开始训练需要多久？），对GPU的数量和质量要求较高，请问您主要是从哪里获取的GPU资源，我这边GPU资源紧缺的话，有没有什么推荐的较低成本的获取途径？感谢您！

how to get and use the pretrained models and reproduce the results of paper

请问您使用过batch_size=128训练过吗？效果怎么样？

PA_MPJPE calculation failed! svd_cuda: (Batch element 0): The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 55).

[my_V1_train_from_scratch_hrnet_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1,2,3.log](https://github.com/Arthur151/ROMP/files/8161185/my_V1_train_from_scratch_hrnet_h36m.mpiinf.coco.mpii.lsp.muco.crowdpose_g0.1.2.3.log) [hrnet_cm64_my_V1_train_from_scratch_hrnet.yml.log](https://github.com/Arthur151/ROMP/files/8161197/hrnet_cm64_my_V1_train_from_scratch_hrnet.yml.log) [my_v1.yml.log](https://github.com/Arthur151/ROMP/files/8161204/my_v1.yml.log) 我的从0开始的训练也是失败了，我只修改了两项 adjust_lr_factor: 1 epoch: 200，请问失败的原因是什么？请问为了避免这种情况发生，可以有哪些思路和操作呢？这种情况是否经常发生？是否正因为如此，您才提供了pretrained backbone（在2d pose数据集上学习过）？

PA_MPJPE calculation failed! svd_cuda: (Batch element 0): The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 55).

我尝试重新加载中间的checkpoint继续训练，训练了10次左右，还是经常出现两个问题：一是之前提到的nan问题，二是PA_MPJPE calculation failed的问题。请问有什么更好的办法调整才能继续正常训练吗？ [V1_hrnet_continue_train_from_epoch_3_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1.log_2250nan.log](https://github.com/Arthur151/ROMP/files/8206797/V1_hrnet_continue_train_from_epoch_3_h36m.mpiinf.coco.mpii.lsp.muco.crowdpose_g0.1.log_2250nan.log) [V1_hrnet_continue_train_from_epoch_3_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1.log_2400nan.log](https://github.com/Arthur151/ROMP/files/8206799/V1_hrnet_continue_train_from_epoch_3_h36m.mpiinf.coco.mpii.lsp.muco.crowdpose_g0.1.log_2400nan.log) [V1_hrnet_continue_train_from_epoch_3_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1.log_2950nan.log](https://github.com/Arthur151/ROMP/files/8206807/V1_hrnet_continue_train_from_epoch_3_h36m.mpiinf.coco.mpii.lsp.muco.crowdpose_g0.1.log_2950nan.log) [V1_hrnet_continue_train_from_epoch4_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1.log_750nan.log](https://github.com/Arthur151/ROMP/files/8206831/V1_hrnet_continue_train_from_epoch4_h36m.mpiinf.coco.mpii.lsp.muco.crowdpose_g0.1.log_750nan.log) [V1_hrnet_continue_train_from_epoch4_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1.log_250nan_vscode.log](https://github.com/Arthur151/ROMP/files/8206832/V1_hrnet_continue_train_from_epoch4_h36m.mpiinf.coco.mpii.lsp.muco.crowdpose_g0.1.log_250nan_vscode.log) [V1_hrnet_continue_train_from_epoch4_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1.log_after650_PA_MPJPE_failed.log](https://github.com/Arthur151/ROMP/files/8206833/V1_hrnet_continue_train_from_epoch4_h36m.mpiinf.coco.mpii.lsp.muco.crowdpose_g0.1.log_after650_PA_MPJPE_failed.log) [V1_hrnet_continue_train_from_epoch4_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1.log_after350_PA_MPJPE_failed.log](https://github.com/Arthur151/ROMP/files/8206834/V1_hrnet_continue_train_from_epoch4_h36m.mpiinf.coco.mpii.lsp.muco.crowdpose_g0.1.log_after350_PA_MPJPE_failed.log) [V1_hrnet_continue_train_from_epoch4_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0.log_350nan.log](https://github.com/Arthur151/ROMP/files/8206835/V1_hrnet_continue_train_from_epoch4_h36m.mpiinf.coco.mpii.lsp.muco.crowdpose_g0.log_350nan.log)