BoxInstSeg
BoxInstSeg copied to clipboard
ValueError: matrix contains invalid numeric entries
Hello, @LiWentomng I tried to reproduce your paper box2mask, but I had the following problems and the model had a very large loss at the beginning of training. How to solve it?
2023-01-07 12:27:40,129 - mmdet - INFO - Iter [50/368750] lr: 5.000e-06, eta: 3 days, 14:44:25, time: 0.847, data_time: 0.050, memory: 6779, loss_cls: 9.3236, loss_project: 6.2381, loss_levelset: 0.0710, d0.loss_cls: 9.0557, d0.loss_project: 5.5436, d0.loss_levelset: 0.0670, d1.loss_cls: 9.3925, d1.loss_project: 5.5199, d1.loss_levelset: 0.0640, d2.loss_cls: 9.1847, d2.loss_project: 5.7577, d2.loss_levelset: 0.0549, d3.loss_cls: 9.3142, d3.loss_project: 5.8749, d3.loss_levelset: 0.0656, d4.loss_cls: 9.4000, d4.loss_project: 5.8713, d4.loss_levelset: 0.0596, d5.loss_cls: 9.0998, d5.loss_project: 6.2049, d5.loss_levelset: 0.0682, d6.loss_cls: 9.1544, d6.loss_project: 6.1733, d6.loss_levelset: 0.0779, d7.loss_cls: 9.0938, d7.loss_project: 6.3329, d7.loss_levelset: 0.0836, d8.loss_cls: 8.7211, d8.loss_project: 6.4827, d8.loss_levelset: 0.0856, loss: 152.4366, grad_norm: 307.3523
Traceback (most recent call last):
File "./tools/train.py", line 242, in
Hello@zhaoyangwei123 The large loss is normal for box2mask. I upload my training log file for coco (r-101). You can refer to it.
I didn't encounter the above problem. It seems the problem of assigner. Are you training for the COCO or your dataset? I have test the codes and configs, which are normal for COCO and VOC.
@LiWentomng I am training for the coco on 8 NVIDIA RTX2080TI GPU. So I changed the image size from (1024, 1024) to (800, 800) with batch=1 and num_workers=0. I don't know if it's because I've changed these parameters.
@zhaoyangwei123 I suggest you firstly try VOC with RTX2080TI GPU. VOC needs the less GPU memory with less training time. The VOC link with coco-format annotaions is here.
I guess that batch_size=1
may incur this problem. I will check this problem.
@zhaoyangwei123
I have fixed this issue. When batch_size=1, the loss values will appear nan
value.
You can try the current codes. Please note when batch_size=1, the learning rate lr
and training step
and max_iters
(50e by default) need to be changed proportionally.
Any further questions can be disscuessed.
@LiWentomng Thank you very much for your answer, but when I run your new code, I have the following problem: Traceback (most recent call last): File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/utils/registry.py", line 69, in build_from_cfg return obj_cls(**args) File "/home/ubuntu/wzy/BoxInstSeg/BoxInstSeg-main/mmdet/datasets/pipelines/transforms.py", line 767, in init assert crop_size[0] > 0 and crop_size[1] > 0 TypeError: '>' not supported between instances of 'tuple' and 'int'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/utils/registry.py", line 69, in build_from_cfg return obj_cls(**args) File "/home/ubuntu/wzy/BoxInstSeg/BoxInstSeg-main/mmdet/datasets/custom.py", line 129, in init self.pipeline = Compose(pipeline) File "/home/ubuntu/wzy/BoxInstSeg/BoxInstSeg-main/mmdet/datasets/pipelines/compose.py", line 23, in init transform = build_from_cfg(transform, PIPELINES) File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/utils/registry.py", line 72, in build_from_cfg raise type(e)(f'{obj_cls.name}: {e}') TypeError: RandomCrop: '>' not supported between instances of 'tuple' and 'int'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "tools/train.py", line 242, in
I verified boxlevelset and boxinst, both work fine, so I think there may be some errors in the box2mask code you uploaded.
@zhaoyangwei123
When did this erro appear? At the starting or during training process?
I have test the code and config with 800x800
and bs=1
, and the training work fine.
According to the reporting error, the format of image size is right as image_size = (800, 800)
in your config ?
Can you share your config information?
@LiWentomng Hello, my error came at the beginning of the training and I have the following config,image_size = (1024,1024), samples_per_gpu=1, workers_per_gpu=0, lr=0.00005. The other configuration is unchanged. Because I found that there are errors reported on multiple GPUs, I considered solving the problem on a single GPU first. On a single 2080TI, the image size can be changed without change. I located the error in line 767 of transforms.py