mdetr icon indicating copy to clipboard operation
mdetr copied to clipboard

Bbox assertion error when using ENB models (eval & pretrain as well): assert (boxes1[:, 2:] >= boxes1[:, :2]).all()

Open OanaIgnat opened this issue 2 years ago • 8 comments

Hi, I am using your requirements file, the same libraries, but I am receiving this bbox assertion error, only when using ENB models (ENB3 & 5), and everything is fine when using ResNet backbone.

It seems that the bbox predictions are all Nan. I've found this error in DETR but no clear solution to it (I tried different lr, different batch sizes).

In eval it appears right away, but in the pretraining mode, it is very random, at different iterations.

Epoch: [0] [ 940/78534] eta: 8:58:29 lr: 0.000100 lr_backbone: 0.000010 lr_text_encoder: 0.000001 loss: 84.7997 (98.7713) loss_bbox: 1.9616 (3.2480) loss_bbox_0: 2.0860 (3.2589) loss_bbox_1: 2.0562 (3.2474) loss_bbox_2: 1.9565 (3.2655) loss_bbox_3: 2.0650 (3.2771) loss_bbox_4: 1.9537 (3.2593) loss_ce: 10.5923 (11.1107) loss_ce_0: 10.5793 (11.1826) loss_ce_1: 10.5315 (11.0932) loss_ce_2: 10.4743 (11.1262) loss_ce_3: 10.5987 (11.1445) loss_ce_4: 10.5877 (11.0967) loss_giou: 1.7100 (2.0738) loss_giou_0: 1.8241 (2.0809) loss_giou_1: 1.7486 (2.0892) loss_giou_2: 1.7185 (2.0759) loss_giou_3: 1.8257 (2.0803) loss_giou_4: 1.6419 (2.0610) cardinality_error_unscaled: 4.8750 (5.8658) cardinality_error_0_unscaled: 4.8750 (7.5942) cardinality_error_1_unscaled: 4.8750 (6.0007) cardinality_error_2_unscaled: 4.8750 (6.0588) cardinality_error_3_unscaled: 4.8750 (5.8966) cardinality_error_4_unscaled: 4.8750 (5.8688) loss_bbox_unscaled: 0.3923 (0.6496) loss_bbox_0_unscaled: 0.4172 (0.6518) loss_bbox_1_unscaled: 0.4112 (0.6495) loss_bbox_2_unscaled: 0.3913 (0.6531) loss_bbox_3_unscaled: 0.4130 (0.6554) loss_bbox_4_unscaled: 0.3907 (0.6519) loss_ce_unscaled: 10.5923 (11.1107) loss_ce_0_unscaled: 10.5793 (11.1826) loss_ce_1_unscaled: 10.5315 (11.0932) loss_ce_2_unscaled: 10.4743 (11.1262) loss_ce_3_unscaled: 10.5987 (11.1445) loss_ce_4_unscaled: 10.5877 (11.0967) loss_giou_unscaled: 0.8550 (1.0369) loss_giou_0_unscaled: 0.9121 (1.0405) loss_giou_1_unscaled: 0.8743 (1.0446) loss_giou_2_unscaled: 0.8593 (1.0379) loss_giou_3_unscaled: 0.9129 (1.0401) loss_giou_4_unscaled: 0.8209 (1.0305) time: 0.4399 data: 0.0077 max mem: 10505 Traceback (most recent call last): File “main.py”, line 646, in main(args) File “main.py”, line 549, in main train_stats = train_one_epoch( File “/home/ubuntu/efs/users/oignat/internship/mdetr/engine.py”, line 72, in train_one_epoch loss_dict.update(criterion(outputs, targets, positive_map)) File “/home/ubuntu/better_glip/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl return forward_call(*input, **kwargs) File “/home/ubuntu/efs/users/oignat/internship/mdetr/models/mdetr.py”, line 666, in forward indices = self.matcher(outputs_without_aux, targets, positive_map) File “/home/ubuntu/better_glip/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl return forward_call(*input, **kwargs) File “/home/ubuntu/better_glip/lib/python3.8/site-packages/torch/autograd/grad_mode.py”, line 27, in decorate_context return func(*args, **kwargs) File “/home/ubuntu/efs/users/oignat/internship/mdetr/models/matcher.py”, line 75, in forward cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox)) File “/home/ubuntu/efs/users/oignat/internship/mdetr/util/box_ops.py”, line 51, in generalized_box_iou assert (boxes1[:, 2:] >= boxes1[:, :2]).all()

OanaIgnat avatar Sep 23 '22 23:09 OanaIgnat

I have the same issue. Any progress ?

Ngheissari avatar Nov 05 '22 02:11 Ngheissari

I get this error as well.

shikunyu8 avatar Jan 31 '23 17:01 shikunyu8

This error most likely indicates divergence in your model. You can check by adding asserts like:

assert not boxes1.isnan().any().item(), "nan in boxes1"

in the generalized_box_iou function which is currently failing.

If this is established, there could be several reasons that is happening, in order of likelihood:

  1. Total batch-size is not big enough. We trained with total batch size = 64 (achieved with batch 2 per gpus and 4 nodes of 8 gpus). Training is possibly going to be stable at bs 32, and maybe at 16 but lower than that is unlikely to work
  2. Mixed precision is tricky to get working. If you tried some sort of AMP autocast or FP16 training that could be the reason for the divergence
  3. Corruption in your training data that somehow triggers gradient explosion.
  4. In some rare cases, if the weight decay is too high it can create nans as well, although it is unlikely in that case because it happens early in training.

Hope this helps

alcinos avatar Jan 31 '23 19:01 alcinos

I got the error when running inference with the released checkpoint of MDETR-EB3 on RefCOCO.

shikunyu8 avatar Feb 01 '23 00:02 shikunyu8

same problem

conan1024hao avatar May 20 '23 08:05 conan1024hao

I got the same problem too when using the pre-trained MDETR-EB5 from torch.hub during inference. Does anyone know how to fix this problem?

Franklin905 avatar Dec 23 '23 02:12 Franklin905

I got the error when running inference with the released checkpoint of MDETR-EB3 on RefCOCO.

Hi,Do you have checked the img feature from EB3,in my case the feature is very large ,sometimes is NAN

lclszsdnr avatar Jan 06 '24 12:01 lclszsdnr

This error most likely indicates divergence in your model. You can check by adding asserts like:

assert not boxes1.isnan().any().item(), "nan in boxes1"

in the generalized_box_iou function which is currently failing.

If this is established, there could be several reasons that is happening, in order of likelihood:

  1. Total batch-size is not big enough. We trained with total batch size = 64 (achieved with batch 2 per gpus and 4 nodes of 8 gpus). Training is possibly going to be stable at bs 32, and maybe at 16 but lower than that is unlikely to work
  2. Mixed precision is tricky to get working. If you tried some sort of AMP autocast or FP16 training that could be the reason for the divergence
  3. Corruption in your training data that somehow triggers gradient explosion.
  4. In some rare cases, if the weight decay is too high it can create nans as well, although it is unlikely in that case because it happens early in training.

Hope this helps

Hi I'm trying FP16 and running evaluation on Flickr30k val set. Its output is also giving "nan", only for few samples though. What is this divergence issue you're speaking of? @alcinos

BatmanofZuhandArrgh avatar Apr 01 '24 18:04 BatmanofZuhandArrgh