BoxeR
BoxeR copied to clipboard
matrix contains invalid numeric entries
when training process reaches the 33th epoch, the following error is reported, in which xxxxxxx denotes a folder
-- Process 4 terminated with the following error:
Traceback (most recent call last):
File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/ma-user/xxxxxxx/user-job-dir/BoxeR/tools/run.py", line 41, in distributed_main
main(configuration, init_distributed=True)
File "/home/ma-user/xxxxxxx/user-job-dir/BoxeR/tools/run.py", line 31, in main
trainer.train()
File "/home/ma-user/xxxxxxx/user-job-dir/BoxeR/e2edet/trainer/base_trainer.py", line 218, in train
train_epoch(0, self)
File "/home/ma-user/xxxxxxx/user-job-dir/BoxeR/e2edet/trainer/engine.py", line 171, in train_epoch
output, _ = _forward("train", batch, model, trainer)
File "/home/ma-user/xxxxxxx/user-job-dir/BoxeR/e2edet/trainer/engine.py", line 208, in _forward
output = model(sample, target)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/ma-user/xxxxxxx/user-job-dir/BoxeR/e2edet/model/base_model.py", line 140, in call
loss_dict = self.losses(model_output, target)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ma-user/xxxxxxx/user-job-dir/BoxeR/e2edet/criterion/losses.py", line 496, in forward
indices = self.matcher(enc_outputs, bin_targets)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/ma-user/xxxxxxx/user-job-dir/BoxeR/e2edet/module/matcher.py", line 136, in forward
linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1))
File "/home/ma-user/xxxxxxx/user-job-dir/BoxeR/e2edet/module/matcher.py", line 136, in
Hi,
I haven't faced this problem with the codebase. Can you try to resume the training with your checkpoint and see whether it happens again?
It is strange that the error occurs until the 33th epoch.
if you use fp16, it can explode but it should not happen when you resume it again