mdetr
mdetr copied to clipboard
Bbox assertion error when using ENB models (eval & pretrain as well): assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
Hi, I am using your requirements file, the same libraries, but I am receiving this bbox assertion error, only when using ENB models (ENB3 & 5), and everything is fine when using ResNet backbone.
It seems that the bbox predictions are all Nan. I've found this error in DETR but no clear solution to it (I tried different lr, different batch sizes).
In eval it appears right away, but in the pretraining mode, it is very random, at different iterations.
Epoch: [0] [ 940/78534] eta: 8:58:29 lr: 0.000100 lr_backbone: 0.000010 lr_text_encoder: 0.000001 loss: 84.7997 (98.7713) loss_bbox: 1.9616 (3.2480) loss_bbox_0: 2.0860 (3.2589) loss_bbox_1: 2.0562 (3.2474) loss_bbox_2: 1.9565 (3.2655) loss_bbox_3: 2.0650 (3.2771) loss_bbox_4: 1.9537 (3.2593) loss_ce: 10.5923 (11.1107) loss_ce_0: 10.5793 (11.1826) loss_ce_1: 10.5315 (11.0932) loss_ce_2: 10.4743 (11.1262) loss_ce_3: 10.5987 (11.1445) loss_ce_4: 10.5877 (11.0967) loss_giou: 1.7100 (2.0738) loss_giou_0: 1.8241 (2.0809) loss_giou_1: 1.7486 (2.0892) loss_giou_2: 1.7185 (2.0759) loss_giou_3: 1.8257 (2.0803) loss_giou_4: 1.6419 (2.0610) cardinality_error_unscaled: 4.8750 (5.8658) cardinality_error_0_unscaled: 4.8750 (7.5942) cardinality_error_1_unscaled: 4.8750 (6.0007) cardinality_error_2_unscaled: 4.8750 (6.0588) cardinality_error_3_unscaled: 4.8750 (5.8966) cardinality_error_4_unscaled: 4.8750 (5.8688) loss_bbox_unscaled: 0.3923 (0.6496) loss_bbox_0_unscaled: 0.4172 (0.6518) loss_bbox_1_unscaled: 0.4112 (0.6495) loss_bbox_2_unscaled: 0.3913 (0.6531) loss_bbox_3_unscaled: 0.4130 (0.6554) loss_bbox_4_unscaled: 0.3907 (0.6519) loss_ce_unscaled: 10.5923 (11.1107) loss_ce_0_unscaled: 10.5793 (11.1826) loss_ce_1_unscaled: 10.5315 (11.0932) loss_ce_2_unscaled: 10.4743 (11.1262) loss_ce_3_unscaled: 10.5987 (11.1445) loss_ce_4_unscaled: 10.5877 (11.0967) loss_giou_unscaled: 0.8550 (1.0369) loss_giou_0_unscaled: 0.9121 (1.0405) loss_giou_1_unscaled: 0.8743 (1.0446) loss_giou_2_unscaled: 0.8593 (1.0379) loss_giou_3_unscaled: 0.9129 (1.0401) loss_giou_4_unscaled: 0.8209 (1.0305) time: 0.4399 data: 0.0077 max mem: 10505
Traceback (most recent call last):
File “main.py”, line 646, in
I have the same issue. Any progress ?
I get this error as well.
This error most likely indicates divergence in your model. You can check by adding asserts like:
assert not boxes1.isnan().any().item(), "nan in boxes1"
in the generalized_box_iou
function which is currently failing.
If this is established, there could be several reasons that is happening, in order of likelihood:
- Total batch-size is not big enough. We trained with total batch size = 64 (achieved with batch 2 per gpus and 4 nodes of 8 gpus). Training is possibly going to be stable at bs 32, and maybe at 16 but lower than that is unlikely to work
- Mixed precision is tricky to get working. If you tried some sort of AMP autocast or FP16 training that could be the reason for the divergence
- Corruption in your training data that somehow triggers gradient explosion.
- In some rare cases, if the weight decay is too high it can create nans as well, although it is unlikely in that case because it happens early in training.
Hope this helps
I got the error when running inference with the released checkpoint of MDETR-EB3 on RefCOCO.
same problem
I got the same problem too when using the pre-trained MDETR-EB5 from torch.hub during inference. Does anyone know how to fix this problem?
I got the error when running inference with the released checkpoint of MDETR-EB3 on RefCOCO.
Hi,Do you have checked the img feature from EB3,in my case the feature is very large ,sometimes is NAN
This error most likely indicates divergence in your model. You can check by adding asserts like:
assert not boxes1.isnan().any().item(), "nan in boxes1"
in the
generalized_box_iou
function which is currently failing.If this is established, there could be several reasons that is happening, in order of likelihood:
- Total batch-size is not big enough. We trained with total batch size = 64 (achieved with batch 2 per gpus and 4 nodes of 8 gpus). Training is possibly going to be stable at bs 32, and maybe at 16 but lower than that is unlikely to work
- Mixed precision is tricky to get working. If you tried some sort of AMP autocast or FP16 training that could be the reason for the divergence
- Corruption in your training data that somehow triggers gradient explosion.
- In some rare cases, if the weight decay is too high it can create nans as well, although it is unlikely in that case because it happens early in training.
Hope this helps
Hi I'm trying FP16 and running evaluation on Flickr30k val set. Its output is also giving "nan", only for few samples though. What is this divergence issue you're speaking of? @alcinos