nnabla-examples icon indicating copy to clipboard operation
nnabla-examples copied to clipboard

CenterNet mixed-precision training cannot work well with specific cuDNN versions

Open hyingho opened this issue 1 year ago • 0 comments

CenterNet mixed-precision training cannot work well with specific cuDNN versions.

How to reproduce

  • Revision: 859d9d4a
  • Branch: master
  • GPUs: V100
  • Use nnabla/nnabla-ext-cuda-multi-gpu:py310-cuda110-mpi3.1.6-v1.34.0 as the base image and install the necessary packages. (see https://github.com/sony/nnabla-examples/blob/master/object-detection/centernet/requirements.txt)
  • The CUDA/cuDNN version of the above image is cuda=11.0.3, CUDNN_VERSION=8.0.5.39
  • Run the following command:
python src/main.py ctdet --config_file=cfg/resnet_18_coco_mp.yaml --data_dir path_to_coco_dataset

Error messages

2023-03-02 06:18:26,839 [nnabla][INFO]: Using DataIterator
2023-03-02 06:18:26,865 [nnabla][INFO]: Creating model...
2023-03-02 06:18:26,865 [nnabla][INFO]: {'hm': 80, 'wh': 2, 'reg': 2}
2023-03-02 06:18:26,865 [nnabla][INFO]: batch size per gpu: 24
[Train] epoch:0/140||loss: -0.0000, hm_loss:245.3517, wh_loss: 28.8467, off_loss: 28.8467, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss: -0.0000, hm_loss:245.3517, wh_loss: 28.8467, off_loss: 28.8467, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:299.5544, hm_loss:296.1249, wh_loss: 29.4914, off_loss: 29.4914, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:299.5544, hm_loss:296.1249, wh_loss: 29.4914, off_loss: 29.4914, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:     nan, hm_loss:     nan, wh_loss: 30.1704, off_loss: 30.1704, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:     nan, hm_loss:     nan, wh_loss: 30.1704, off_loss: 30.1704, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:     nan, hm_loss:     nan, wh_loss: 21.1151, off_loss: 21.1151, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:     nan, hm_loss:     nan, wh_loss: 21.1151, off_loss: 21.1151, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:     nan, hm_loss:     nan, wh_loss: 24.2714, off_loss: 24.2714, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:     nan, hm_loss:     nan, wh_loss: 24.2714, off_loss: 24.2714, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:     nan, hm_loss:     nan, wh_loss: 21.7357, off_loss: 21.7357, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:     nan, hm_loss:     nan, wh_loss: 21.7357, off_loss: 21.7357, lr:1.00e-04, scale:4.00e+00:   0%|          | 6/4929 [00:06<1:33:43,  1.14s/it]^C

or

2023-03-02 05:47:38,953 [nnabla][INFO]: Using DataIterator
2023-03-02 05:47:38,959 [nnabla][INFO]: Creating model...
2023-03-02 05:47:38,959 [nnabla][INFO]: {'hm': 80, 'reg': 2, 'wh': 2}
2023-03-02 05:47:38,964 [nnabla][INFO]: batch size per gpu: 32
^M  0%|          | 0/3697 [00:00<?, ?it/s]^M  0%|          | 0/3697 [00:04<?, ?it/s]
Traceback (most recent call last):
  File "nnabla-examples/object-detection/centernet/src/main.py", line 147, in <module>
    main(opt)
  File "nnabla-examples/object-detection/centernet/src/main.py", line 112, in main
    _ = trainer.update(epoch)
  File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 191, in update
    total_loss, hm_loss, wh_loss, off_loss = self.compute_gradient(
  File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 178, in compute_gradient
    return self.compute_gradient(data)
  File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 178, in compute_gradient
    return self.compute_gradient(data)
  File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 178, in compute_gradient
    return self.compute_gradient(data)
  [Previous line repeated 7 more times]
  File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 175, in compute_gradient
    raise RuntimeError(
RuntimeError: Something went wrong with gradient calculations.
--------------------------------------------------------------------------

How to solve

Using a newer cuDNN version solved this issue.

  • Use nnabla/nnabla-ext-cuda-multi-gpu:py310-cuda116-mpi3.1.6-v1.34.0 as the base image and install the necessary packages. (see https://github.com/sony/nnabla-examples/blob/master/object-detection/centernet/requirements.txt)
  • The CUDA/cuDNN version of the above image is cuda=11.6.0, CUDNN_VERSION=8.4.0.27

hyingho avatar Mar 09 '23 07:03 hyingho