The original code does not change, after increasing the amount of training data, when training PBAFN_e2e code, training to 78 epoch, the network neither runs nor reports errors, and is at a standstill. What is the reason for this?
Hello @hyyuan123 I am planning to train the same model using a larger dataset. I am wondering if the problem with training still exists. If not, how did you resolve the issue?
I encountered a problem during the training at the 50th epoch. Can you help me take a look and see what might have caused the issue?
Traceback (most recent call last):
File "/home/xulei/CodePAF/PF-AFN-main/PF-AFN_train/train_PBAFN_stage1.py", line 196, in
warp_model.module.update_learning_rate(optimizer_warp)
File "/usr/local/anaconda3/envs/xulei/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1177, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'AFWM' object has no attribute 'module'
update learning rate: 0.000050 -> 0.000049
Traceback (most recent call last):
File "/home/xulei/CodePAF/PF-AFN-main/PF-AFN_train/train_PBAFN_stage1.py", line 196, in
warp_model.module.update_learning_rate(optimizer_warp)
File "/usr/local/anaconda3/envs/xulei/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1177, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'AFWM' object has no attribute 'module'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2184173) of binary: /usr/local/anaconda3/envs/xulei/bin/python