ROMP icon indicating copy to clipboard operation
ROMP copied to clipboard

PA_MPJPE calculation failed! svd_cuda: (Batch element 0): The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 55).

Open jiheeyang opened this issue 3 years ago • 14 comments

Hi, I have difficulty training for the 6 dataset(mpiinf, coco, mpii, lsp, muco, crowdpose). The training code can run successfully for a period of time (not more than some epoch) and then will encounter this error in the training logs file. Can you give me a solution for this?

6 epoch

In 6 epoch , the Losses can be output, but the "INFO:root:Evaluation on pw3d" is nan value.

image image

7 epoch

In 7 epoch, the log is " PA_MPJPE calculation failed! svd_cuda: (Batch element 0): The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 55)".

image

jiheeyang avatar Jan 02 '22 12:01 jiheeyang

Sorry about that! It seems that the training is not converged. The loss is quite huge. When PA_MPJPE calculation failed occurs, it means that the training is completely failed.

Could you please share the configuration .yml file for training? Especially the batch size you set. Did you start training from the pre-train model?

Arthur151 avatar Jan 04 '22 02:01 Arthur151

This is the configuration .yml file for training . I changed GPUS, datasets, and sample_prob in configs/v1.yml.

ARGS:
 tab: 'V1_hrnet' 
 dataset: 'mpiinf,coco,mpii,lsp,muco,crowdpose'
 GPUS: 0,1,
 distributed_training: False
 model_version: 1
 pretrain: 'imagenet'
 match_preds_to_gts_for_supervision: True

 master_batch_size: -1
 val_batch_size: 16
 batch_size: 64
 nw: 4
 nw_eval: 2
 lr: 0.00005

 fine_tune: False
 fix_backbone_training_scratch: False
 eval: False
 supervise_global_rot: False

 model_return_loss: False
 collision_aware_centermap: True
 collision_factor: 0.2
 homogenize_pose_space: True
 shuffle_crop_mode: True
 shuffle_crop_ratio_2d: 0.1
 shuffle_crop_ratio_3d: 0.4

 merge_smpl_camera_head: False
 head_block_num: 2

 backbone: 'hrnet'
 centermap_size: 64
 centermap_conf_thresh: 0.2

 model_path: None

loss_weight:
  MPJPE: 200.
  PAMPJPE: 360.
  P_KP2D: 400.
  Pose: 80.
  Shape: 6.
  Prior: 1.6
  CenterMap: 160.

sample_prob:
 h36m: 0.0
 mpiinf: 0.16
 coco: 0.2
 lsp: 0.16
 mpii: 0.2
 muco: 0.14
 crowdpose: 0.14

jiheeyang avatar Jan 04 '22 06:01 jiheeyang

I strongly recommand to adjust the sample_prob. Setting the sampling rate of different dataset is supposed to take the number of samples into accout. Please reduce the sampling rate of lsp,mpii that contains fewer samples, 'crowdpose,coco' that contains weak annotations. The early stage of training still need accurate 3D pose dataset, which is why I develop shuffle_crop_ratio_3d.

Arthur151 avatar Jan 04 '22 08:01 Arthur151

Thank you. I reduced sample_prob for lsp,mpii. So more epochs are processed, but the same error occurs. I modified a part of the code dataset/image_base.py.
I modified 2 to 1 because an error occurred as follows File "/hdd1/YJH/romp_pytorch/ROMP/romp/lib/dataset/image_base.py", line 498, in test_dataset img_bsname = os.path.basename(r['imgpath'][inds]) IndexError: list index out of range

image

Is it related to this?

jiheeyang avatar Jan 06 '22 02:01 jiheeyang

Please note that test_dataset is only excuted when you want to test the data loading of a specific dataset. test_dataset would be excuted when formal usage, like training, testing, or evaluation. The batch size defined at here decides the length of list.

Arthur151 avatar Jan 06 '22 02:01 Arthur151

my_V1_train_from_scratch_hrnet_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1,2,3.log hrnet_cm64_my_V1_train_from_scratch_hrnet.yml.log my_v1.yml.log

我的从0开始的训练也是失败了,我只修改了两项 adjust_lr_factor: 1 epoch: 200,请问失败的原因是什么?请问为了避免这种情况发生,可以有哪些思路和操作呢?这种情况是否经常发生?是否正因为如此,您才提供了pretrained backbone(在2d pose数据集上学习过)?

liangwx avatar Mar 01 '22 11:03 liangwx

看log应该是loss的异常导致的梯度爆炸。我这里只在之前制作预训练模型测试训练的时候出现过,重新加载中间的checkpoint继续训练就好了。会在训练早期出现这种问题,具体什么原因导致的确实没细致研究过。但使用pretrain模型,跨过基础特征构建阶段,就不会有这个问题。

Arthur151 avatar Mar 01 '22 13:03 Arthur151

看起来其中似乎有一定的必然性的问题在里面,如果只是偶然nan的话,不太应该会在重新加载中间的checkpoint继续训练后很快重新出现nan等问题

liangwx avatar Mar 08 '22 14:03 liangwx

是的,从你的log也可以看出是有一些问题。这些都是train from scratch的checkpoint的finetune吧,用pretrain不会有这个问题。实际上,我从0训练的时候是训了2D pose的heatmap和identity map的,同时学2D pose信息的时候就没有这个问题,如果您实在费劲,您可以试着用HigherHRNet的HRNet-32 pretraining开始训练,比如这个,他的那个也是训过2D pose的。排除掉其他因素的干扰,应该就是2D pose的特征对于特征构建很关键,基于这点的话,从HigherHRNet的pretraining开始训练应该不会有这个问题了。出现这个bug实在抱歉,当时开源的时候只测试了基于pretaining的训练没问题,train from scratch训太久了,赶deadline就没试。和我当时训练的不同就是2D pose的预训练了,我也会尽快实验复核这个问题!

Arthur151 avatar Mar 08 '22 14:03 Arthur151

原来如此。还有个问题就是第一次出现nan之后,应该是loss的nan导致参数梯度的nan,从而使得参数出现异常值,但为什么接下来一些step还有loss不是nan的正常值出现?

liangwx avatar Mar 09 '22 03:03 liangwx

I guess that it might be this line.

Arthur151 avatar Mar 09 '22 03:03 Arthur151

不是很理解,nan/(nan/1000.)不是还是nan吗?

liangwx avatar Mar 09 '22 04:03 liangwx

Yes, you are right. Maybe we can add something like this to avoid gradient collapse:

loss_list = [0 if torch.isnan(value.item()) else value for key, value in loss_dict.items()]
loss = sum([value if value.item()<args().loss_thresh else value/(value.item()/args().loss_thresh) for value in loss_list])

Arthur151 avatar Mar 09 '22 04:03 Arthur151