BoxeR icon indicating copy to clipboard operation
BoxeR copied to clipboard

matrix contains invalid numeric entries

Open mountain111 opened this issue 2 years ago • 3 comments

when training process reaches the 33th epoch, the following error is reported, in which xxxxxxx denotes a folder

-- Process 4 terminated with the following error: Traceback (most recent call last): File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/home/ma-user/xxxxxxx/user-job-dir/BoxeR/tools/run.py", line 41, in distributed_main main(configuration, init_distributed=True) File "/home/ma-user/xxxxxxx/user-job-dir/BoxeR/tools/run.py", line 31, in main trainer.train() File "/home/ma-user/xxxxxxx/user-job-dir/BoxeR/e2edet/trainer/base_trainer.py", line 218, in train train_epoch(0, self) File "/home/ma-user/xxxxxxx/user-job-dir/BoxeR/e2edet/trainer/engine.py", line 171, in train_epoch output, _ = _forward("train", batch, model, trainer) File "/home/ma-user/xxxxxxx/user-job-dir/BoxeR/e2edet/trainer/engine.py", line 208, in _forward output = model(sample, target) File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(*inputs[0], **kwargs[0]) File "/home/ma-user/xxxxxxx/user-job-dir/BoxeR/e2edet/model/base_model.py", line 140, in call loss_dict = self.losses(model_output, target) File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/ma-user/xxxxxxx/user-job-dir/BoxeR/e2edet/criterion/losses.py", line 496, in forward indices = self.matcher(enc_outputs, bin_targets) File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, **kwargs) File "/home/ma-user/xxxxxxx/user-job-dir/BoxeR/e2edet/module/matcher.py", line 136, in forward linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1)) File "/home/ma-user/xxxxxxx/user-job-dir/BoxeR/e2edet/module/matcher.py", line 136, in linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1)) File "/home/ma-user/anaconda/lib/python3.7/site-packages/scipy/optimize/_lsap.py", line 93, in linear_sum_assignment raise ValueError("matrix contains invalid numeric entries") ValueError: matrix contains invalid numeric entries

mountain111 avatar May 18 '22 10:05 mountain111

Hi,

I haven't faced this problem with the codebase. Can you try to resume the training with your checkpoint and see whether it happens again?

kienduynguyen avatar May 18 '22 11:05 kienduynguyen

It is strange that the error occurs until the 33th epoch.

mountain111 avatar May 19 '22 01:05 mountain111

if you use fp16, it can explode but it should not happen when you resume it again

kienduynguyen avatar May 19 '22 06:05 kienduynguyen