Training Error assert (boxes1[:, 2:] >= boxes1[:, :2]).all()

Open Kyfafyd opened this issue 3 years ago • 2 comments

Instructions To Reproduce the 🐛 Bug:

what changes you made (git diff) or what code you wrote

Nothing change

what exact command you run: python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path ../data/COCO2017 --output_dir output/conddetr_r50_epoch50
what you observed (including full logs):

| distributed init (rank 2): env://
| distributed init (rank 0): env://
| distributed init (rank 4): env://
| distributed init (rank 3): env://
| distributed init (rank 5): env://
| distributed init (rank 1): env://
| distributed init (rank 7): env://
| distributed init (rank 6): env://
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
git:
  sha: N/A, status: clean, branch: N/A

fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
Namespace(aux_loss=True, backbone='resnet50', batch_size=2, bbox_loss_coef=5, clip_max_norm=0.1, cls_loss_coef=2, coco_panoptic_path=None, coco_path='../data/COCO2017', dataset_file='coco', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_backend='nccl', dist_url='env://', distributed=True, dropout=0.1, enc_layers=6, epochs=50, eval=False, focal_alpha=0.25, frozen_weights=None, giou_loss_coef=2, gpu=0, hidden_dim=256, lr=0.0001, lr_backbone=1e-05, lr_drop=40, mask_loss_coef=1, masks=False, nheads=8, num_queries=300, num_workers=2, output_dir='output/conddetr_r50_epoch50', position_embedding='sine', pre_norm=False, rank=0, remove_difficult=False, resume='', seed=42, set_cost_bbox=5, set_cost_class=2, set_cost_giou=2, start_epoch=0, weight_decay=0.0001, world_size=8)
number of params: 43196001
loading annotations into memory...
Done (t=20.78s)
creating index...
index created!
loading annotations into memory...
Done (t=0.56s)
creating index...
index created!
Start training
Epoch: [0]  [   0/7393]  eta: 7:05:21  lr: 0.000100  class_error: 85.57  loss: 45.1821 (45.1821)  loss_bbox: 3.7751 (3.7751)  loss_bbox_0: 3.7823 (3.7823)  loss_bbox_1: 3.7808 (3.7808)  loss_bbox_2: 3.7756 (3.7756)  loss_bbox_3: 3.7911 (3.7911)  loss_bbox_4: 3.7856 (3.7856)  loss_ce: 1.9574 (1.9574)  loss_ce_0: 2.0151 (2.0151)  loss_ce_1: 2.0196 (2.0196)  loss_ce_2: 2.1484 (2.1484)  loss_ce_3: 2.0683 (2.0683)  loss_ce_4: 2.0683 (2.0683)  loss_giou: 1.7011 (1.7011)  loss_giou_0: 1.7000 (1.7000)  loss_giou_1: 1.7040 (1.7040)  loss_giou_2: 1.7059 (1.7059)  loss_giou_3: 1.7022 (1.7022)  loss_giou_4: 1.7012 (1.7012)  cardinality_error_unscaled: 293.1250 (293.1250)  cardinality_error_0_unscaled: 293.1250 (293.1250)  cardinality_error_1_unscaled: 293.1250 (293.1250)  cardinality_error_2_unscaled: 281.9375 (281.9375)  cardinality_error_3_unscaled: 293.1250 (293.1250)  cardinality_error_4_unscaled: 293.1250 (293.1250)  class_error_unscaled: 85.5712 (85.5712)  loss_bbox_unscaled: 0.7550 (0.7550)  loss_bbox_0_unscaled: 0.7565 (0.7565)  loss_bbox_1_unscaled: 0.7562 (0.7562)  loss_bbox_2_unscaled: 0.7551 (0.7551)  loss_bbox_3_unscaled: 0.7582 (0.7582)  loss_bbox_4_unscaled: 0.7571 (0.7571)  loss_ce_unscaled: 0.9787 (0.9787)  loss_ce_0_unscaled: 1.0076 (1.0076)  loss_ce_1_unscaled: 1.0098 (1.0098)  loss_ce_2_unscaled: 1.0742 (1.0742)  loss_ce_3_unscaled: 1.0341 (1.0341)  loss_ce_4_unscaled: 1.0342 (1.0342)  loss_giou_unscaled: 0.8506 (0.8506)  loss_giou_0_unscaled: 0.8500 (0.8500)  loss_giou_1_unscaled: 0.8520 (0.8520)  loss_giou_2_unscaled: 0.8530 (0.8530)  loss_giou_3_unscaled: 0.8511 (0.8511)  loss_giou_4_unscaled: 0.8506 (0.8506)  time: 3.4521  data: 0.4687  max mem: 2932
Epoch: [0]  [ 100/7393]  eta: 1:17:39  lr: 0.000100  class_error: 85.74  loss: 28.2629 (33.7855)  loss_bbox: 1.5517 (2.3437)  loss_bbox_0: 1.5566 (2.3695)  loss_bbox_1: 1.5482 (2.3519)  loss_bbox_2: 1.5535 (2.3396)  loss_bbox_3: 1.5641 (2.3476)  loss_bbox_4: 1.5637 (2.3431)  loss_ce: 1.5467 (1.6584)  loss_ce_0: 1.5650 (1.6414)  loss_ce_1: 1.5443 (1.6461)  loss_ce_2: 1.5557 (1.6477)  loss_ce_3: 1.5392 (1.6545)  loss_ce_4: 1.5541 (1.6667)  loss_giou: 1.5534 (1.6289)  loss_giou_0: 1.5514 (1.6296)  loss_giou_1: 1.5541 (1.6292)  loss_giou_2: 1.5695 (1.6291)  loss_giou_3: 1.5526 (1.6289)  loss_giou_4: 1.5519 (1.6296)  cardinality_error_unscaled: 293.1875 (293.2420)  cardinality_error_0_unscaled: 293.1875 (293.2420)  cardinality_error_1_unscaled: 293.1875 (293.2420)  cardinality_error_2_unscaled: 293.1875 (293.1312)  cardinality_error_3_unscaled: 293.1875 (293.2420)  cardinality_error_4_unscaled: 293.1875 (293.1658)  class_error_unscaled: 75.6680 (75.4478)  loss_bbox_unscaled: 0.3103 (0.4687)  loss_bbox_0_unscaled: 0.3113 (0.4739)  loss_bbox_1_unscaled: 0.3096 (0.4704)  loss_bbox_2_unscaled: 0.3107 (0.4679)  loss_bbox_3_unscaled: 0.3128 (0.4695)  loss_bbox_4_unscaled: 0.3127 (0.4686)  loss_ce_unscaled: 0.7733 (0.8292)  loss_ce_0_unscaled: 0.7825 (0.8207)  loss_ce_1_unscaled: 0.7722 (0.8231)  loss_ce_2_unscaled: 0.7779 (0.8239)  loss_ce_3_unscaled: 0.7696 (0.8272)  loss_ce_4_unscaled: 0.7770 (0.8334)  loss_giou_unscaled: 0.7767 (0.8145)  loss_giou_0_unscaled: 0.7757 (0.8148)  loss_giou_1_unscaled: 0.7771 (0.8146)  loss_giou_2_unscaled: 0.7847 (0.8146)  loss_giou_3_unscaled: 0.7763 (0.8144)  loss_giou_4_unscaled: 0.7760 (0.8148)  time: 0.6098  data: 0.0105  max mem: 4353
Traceback (most recent call last):
  File "main.py", line 258, in <module>
    main(args)
  File "main.py", line 206, in main
    train_stats = train_one_epoch(
  File "/research/d4/gds/zwang21/ConditionalDETR/engine.py", line 41, in train_one_epoch
    loss_dict = criterion(outputs, targets)
  File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/research/d4/gds/zwang21/ConditionalDETR/models/conditional_detr.py", line 254, in forward
    indices = self.matcher(outputs_without_aux, targets)
  File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/research/d4/gds/zwang21/ConditionalDETR/models/matcher.py", line 79, in forward
    cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))
  File "/research/d4/gds/zwang21/ConditionalDETR/util/box_ops.py", line 59, in generalized_box_iou
    assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
AssertionError
Traceback (most recent call last):
  File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/research/d4/gds/zwang21/anaconda3/bin/python', '-u', 'main.py', '--coco_path', '../data/COCO2017', '--output_dir', 'output/conddetr_r50_epoch50']' returned non-zero exit status 1.
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Killing subprocess 29668
Killing subprocess 29669
Killing subprocess 29670
Killing subprocess 29671
Killing subprocess 29672
Killing subprocess 29673
Killing subprocess 29674
Killing subprocess 29675

please simplify the steps as much as possible so they do not require additional resources to run, such as a private dataset.

Expected behavior:

If there are no obvious error in "what you observed" provided above, please tell us the expected behavior.

Environment:

Provide your environment information using the following command:

Collecting environment information...
PyTorch version: 1.8.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: CentOS Linux release 7.9.2009 (Core) (x86_64)
GCC version: (GCC) 11.2.0
Clang version: Could not collect
CMake version: version 2.8.12.2

Python version: 3.8 (64-bit runtime)
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.22.2
[pip3] numpydoc==1.1.0
[pip3] pytorch-ignite==0.2.0
[pip3] pytorch-metric-learning==0.9.99
[pip3] torch==1.8.0
[pip3] torchaudio==0.8.0a0+a751e1d
[pip3] torchfile==0.1.0
[pip3] torchsampler==0.1.1
[pip3] torchsummary==1.5.1
[pip3] torchvision==0.9.0
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               10.2.89              hfd86e86_1  
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.2.0           h06a4308_296  
[conda] mkl-service               2.3.0            py38h27cfd23_1  
[conda] mkl_fft                   1.3.0            py38h42c9631_2  
[conda] mkl_random                1.2.1            py38ha9443f7_2  
[conda] numpy                     1.22.2                   pypi_0    pypi
[conda] numpydoc                  1.1.0              pyhd3eb1b0_1  
[conda] pytorch                   1.8.0           py3.8_cuda10.2_cudnn7.6.5_0    pytorch
[conda] pytorch-ignite            0.2.0                    pypi_0    pypi
[conda] pytorch-metric-learning   0.9.99                   pypi_0    pypi
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torch                     1.10.0                   pypi_0    pypi
[conda] torchaudio                0.8.0                      py38    pytorch
[conda] torchfile                 0.1.0                    pypi_0    pypi
[conda] torchsampler              0.1.1                    pypi_0    pypi
[conda] torchsummary              1.5.1                    pypi_0    pypi
[conda] torchvision               0.9.0                py38_cu102    pytorch

Mar 24 '22 08:03 Kyfafyd

Sorry but we never encountered this error. It indicates that the predicted boxes have a negative width or height. Which should not happen. The predicted (cx, cy, h, w) are fed into a sigmoid, so all h, w should be in range [0, 1].

Apr 01 '22 02:04 DeppMeng