PF-AFN icon indicating copy to clipboard operation
PF-AFN copied to clipboard

Training problem

Open hyyuan123 opened this issue 2 years ago • 2 comments

The original code does not change, after increasing the amount of training data, when training PBAFN_e2e code, training to 78 epoch, the network neither runs nor reports errors, and is at a standstill. What is the reason for this?

hyyuan123 avatar Mar 16 '23 03:03 hyyuan123

Hello @hyyuan123 I am planning to train the same model using a larger dataset. I am wondering if the problem with training still exists. If not, how did you resolve the issue?

MosbehBarhoumiRAI avatar Apr 24 '23 09:04 MosbehBarhoumiRAI

I encountered a problem during the training at the 50th epoch. Can you help me take a look and see what might have caused the issue? Traceback (most recent call last): File "/home/xulei/CodePAF/PF-AFN-main/PF-AFN_train/train_PBAFN_stage1.py", line 196, in warp_model.module.update_learning_rate(optimizer_warp) File "/usr/local/anaconda3/envs/xulei/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1177, in getattr raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'AFWM' object has no attribute 'module' update learning rate: 0.000050 -> 0.000049 Traceback (most recent call last): File "/home/xulei/CodePAF/PF-AFN-main/PF-AFN_train/train_PBAFN_stage1.py", line 196, in warp_model.module.update_learning_rate(optimizer_warp) File "/usr/local/anaconda3/envs/xulei/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1177, in getattr raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'AFWM' object has no attribute 'module' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2184173) of binary: /usr/local/anaconda3/envs/xulei/bin/python

xxxxl888 avatar Jun 11 '23 13:06 xxxxl888