mdetr Bbox assertion error when using ENB models (eval & pretrain as well): assert (boxes1[:, 2:] >= boxes1[:, :2]).all()

Hi, I am using your requirements file, the same libraries, but I am receiving this bbox assertion error, only when using ENB models (ENB3 & 5), and everything is fine when using ResNet backbone.

It seems that the bbox predictions are all Nan. I've found this error in DETR but no clear solution to it (I tried different lr, different batch sizes).

In eval it appears right away, but in the pretraining mode, it is very random, at different iterations.

Epoch: [0] [ 940/78534] eta: 8:58:29 lr: 0.000100 lr_backbone: 0.000010 lr_text_encoder: 0.000001 loss: 84.7997 (98.7713) loss_bbox: 1.9616 (3.2480) loss_bbox_0: 2.0860 (3.2589) loss_bbox_1: 2.0562 (3.2474) loss_bbox_2: 1.9565 (3.2655) loss_bbox_3: 2.0650 (3.2771) loss_bbox_4: 1.9537 (3.2593) loss_ce: 10.5923 (11.1107) loss_ce_0: 10.5793 (11.1826) loss_ce_1: 10.5315 (11.0932) loss_ce_2: 10.4743 (11.1262) loss_ce_3: 10.5987 (11.1445) loss_ce_4: 10.5877 (11.0967) loss_giou: 1.7100 (2.0738) loss_giou_0: 1.8241 (2.0809) loss_giou_1: 1.7486 (2.0892) loss_giou_2: 1.7185 (2.0759) loss_giou_3: 1.8257 (2.0803) loss_giou_4: 1.6419 (2.0610) cardinality_error_unscaled: 4.8750 (5.8658) cardinality_error_0_unscaled: 4.8750 (7.5942) cardinality_error_1_unscaled: 4.8750 (6.0007) cardinality_error_2_unscaled: 4.8750 (6.0588) cardinality_error_3_unscaled: 4.8750 (5.8966) cardinality_error_4_unscaled: 4.8750 (5.8688) loss_bbox_unscaled: 0.3923 (0.6496) loss_bbox_0_unscaled: 0.4172 (0.6518) loss_bbox_1_unscaled: 0.4112 (0.6495) loss_bbox_2_unscaled: 0.3913 (0.6531) loss_bbox_3_unscaled: 0.4130 (0.6554) loss_bbox_4_unscaled: 0.3907 (0.6519) loss_ce_unscaled: 10.5923 (11.1107) loss_ce_0_unscaled: 10.5793 (11.1826) loss_ce_1_unscaled: 10.5315 (11.0932) loss_ce_2_unscaled: 10.4743 (11.1262) loss_ce_3_unscaled: 10.5987 (11.1445) loss_ce_4_unscaled: 10.5877 (11.0967) loss_giou_unscaled: 0.8550 (1.0369) loss_giou_0_unscaled: 0.9121 (1.0405) loss_giou_1_unscaled: 0.8743 (1.0446) loss_giou_2_unscaled: 0.8593 (1.0379) loss_giou_3_unscaled: 0.9129 (1.0401) loss_giou_4_unscaled: 0.8209 (1.0305) time: 0.4399 data: 0.0077 max mem: 10505 Traceback (most recent call last): File “main.py”, line 646, in main(args) File “main.py”, line 549, in main train_stats = train_one_epoch( File “/home/ubuntu/efs/users/oignat/internship/mdetr/engine.py”, line 72, in train_one_epoch loss_dict.update(criterion(outputs, targets, positive_map)) File “/home/ubuntu/better_glip/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl return forward_call(*input, **kwargs) File “/home/ubuntu/efs/users/oignat/internship/mdetr/models/mdetr.py”, line 666, in forward indices = self.matcher(outputs_without_aux, targets, positive_map) File “/home/ubuntu/better_glip/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl return forward_call(*input, **kwargs) File “/home/ubuntu/better_glip/lib/python3.8/site-packages/torch/autograd/grad_mode.py”, line 27, in decorate_context return func(*args, **kwargs) File “/home/ubuntu/efs/users/oignat/internship/mdetr/models/matcher.py”, line 75, in forward cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox)) File “/home/ubuntu/efs/users/oignat/internship/mdetr/util/box_ops.py”, line 51, in generalized_box_iou assert (boxes1[:, 2:] >= boxes1[:, :2]).all()

Sep 23 '22 23:09 OanaIgnat

I have the same issue. Any progress ?

Nov 05 '22 02:11 Ngheissari

I get this error as well.

Jan 31 '23 17:01 shikunyu8

This error most likely indicates divergence in your model. You can check by adding asserts like:

assert not boxes1.isnan().any().item(), "nan in boxes1"

in the generalized_box_iou function which is currently failing.

If this is established, there could be several reasons that is happening, in order of likelihood:

Total batch-size is not big enough. We trained with total batch size = 64 (achieved with batch 2 per gpus and 4 nodes of 8 gpus). Training is possibly going to be stable at bs 32, and maybe at 16 but lower than that is unlikely to work
Mixed precision is tricky to get working. If you tried some sort of AMP autocast or FP16 training that could be the reason for the divergence
Corruption in your training data that somehow triggers gradient explosion.
In some rare cases, if the weight decay is too high it can create nans as well, although it is unlikely in that case because it happens early in training.

Hope this helps

Jan 31 '23 19:01 alcinos

I got the error when running inference with the released checkpoint of MDETR-EB3 on RefCOCO.

Feb 01 '23 00:02 shikunyu8

same problem

May 20 '23 08:05 conan1024hao

I got the same problem too when using the pre-trained MDETR-EB5 from torch.hub during inference. Does anyone know how to fix this problem?

Dec 23 '23 02:12 Franklin905

I got the error when running inference with the released checkpoint of MDETR-EB3 on RefCOCO.

Hi,Do you have checked the img feature from EB3,in my case the feature is very large ,sometimes is NAN

Jan 06 '24 12:01 lclszsdnr

This error most likely indicates divergence in your model. You can check by adding asserts like:
assert not boxes1.isnan().any().item(), "nan in boxes1"
in the generalized_box_iou function which is currently failing.

If this is established, there could be several reasons that is happening, in order of likelihood:

Total batch-size is not big enough. We trained with total batch size = 64 (achieved with batch 2 per gpus and 4 nodes of 8 gpus). Training is possibly going to be stable at bs 32, and maybe at 16 but lower than that is unlikely to work

Mixed precision is tricky to get working. If you tried some sort of AMP autocast or FP16 training that could be the reason for the divergence

Corruption in your training data that somehow triggers gradient explosion.

In some rare cases, if the weight decay is too high it can create nans as well, although it is unlikely in that case because it happens early in training.

Hope this helps

Hi I'm trying FP16 and running evaluation on Flickr30k val set. Its output is also giving "nan", only for few samples though. What is this divergence issue you're speaking of? @alcinos

Apr 01 '24 18:04 BatmanofZuhandArrgh

mdetr mdetr copied to clipboard

Bbox assertion error when using ENB models (eval & pretrain as well): assert (boxes1[:, 2:] >= boxes1[:, :2]).all()

mdetr
mdetr copied to clipboard