nnabla-examples
nnabla-examples copied to clipboard
CenterNet mixed-precision training cannot work well with specific cuDNN versions
CenterNet mixed-precision training cannot work well with specific cuDNN versions.
How to reproduce
- Revision: 859d9d4a
- Branch:
master
- GPUs: V100
- Use
nnabla/nnabla-ext-cuda-multi-gpu:py310-cuda110-mpi3.1.6-v1.34.0
as the base image and install the necessary packages. (see https://github.com/sony/nnabla-examples/blob/master/object-detection/centernet/requirements.txt) - The CUDA/cuDNN version of the above image is
cuda=11.0.3, CUDNN_VERSION=8.0.5.39
- Run the following command:
python src/main.py ctdet --config_file=cfg/resnet_18_coco_mp.yaml --data_dir path_to_coco_dataset
Error messages
2023-03-02 06:18:26,839 [nnabla][INFO]: Using DataIterator
2023-03-02 06:18:26,865 [nnabla][INFO]: Creating model...
2023-03-02 06:18:26,865 [nnabla][INFO]: {'hm': 80, 'wh': 2, 'reg': 2}
2023-03-02 06:18:26,865 [nnabla][INFO]: batch size per gpu: 24
[Train] epoch:0/140||loss: -0.0000, hm_loss:245.3517, wh_loss: 28.8467, off_loss: 28.8467, lr:1.00e-04, scale:4.00e+00: 0%|
[Train] epoch:0/140||loss: -0.0000, hm_loss:245.3517, wh_loss: 28.8467, off_loss: 28.8467, lr:1.00e-04, scale:4.00e+00: 0%|
[Train] epoch:0/140||loss:299.5544, hm_loss:296.1249, wh_loss: 29.4914, off_loss: 29.4914, lr:1.00e-04, scale:4.00e+00: 0%|
[Train] epoch:0/140||loss:299.5544, hm_loss:296.1249, wh_loss: 29.4914, off_loss: 29.4914, lr:1.00e-04, scale:4.00e+00: 0%|
[Train] epoch:0/140||loss: nan, hm_loss: nan, wh_loss: 30.1704, off_loss: 30.1704, lr:1.00e-04, scale:4.00e+00: 0%|
[Train] epoch:0/140||loss: nan, hm_loss: nan, wh_loss: 30.1704, off_loss: 30.1704, lr:1.00e-04, scale:4.00e+00: 0%|
[Train] epoch:0/140||loss: nan, hm_loss: nan, wh_loss: 21.1151, off_loss: 21.1151, lr:1.00e-04, scale:4.00e+00: 0%|
[Train] epoch:0/140||loss: nan, hm_loss: nan, wh_loss: 21.1151, off_loss: 21.1151, lr:1.00e-04, scale:4.00e+00: 0%|
[Train] epoch:0/140||loss: nan, hm_loss: nan, wh_loss: 24.2714, off_loss: 24.2714, lr:1.00e-04, scale:4.00e+00: 0%|
[Train] epoch:0/140||loss: nan, hm_loss: nan, wh_loss: 24.2714, off_loss: 24.2714, lr:1.00e-04, scale:4.00e+00: 0%|
[Train] epoch:0/140||loss: nan, hm_loss: nan, wh_loss: 21.7357, off_loss: 21.7357, lr:1.00e-04, scale:4.00e+00: 0%|
[Train] epoch:0/140||loss: nan, hm_loss: nan, wh_loss: 21.7357, off_loss: 21.7357, lr:1.00e-04, scale:4.00e+00: 0%| | 6/4929 [00:06<1:33:43, 1.14s/it]^C
or
2023-03-02 05:47:38,953 [nnabla][INFO]: Using DataIterator
2023-03-02 05:47:38,959 [nnabla][INFO]: Creating model...
2023-03-02 05:47:38,959 [nnabla][INFO]: {'hm': 80, 'reg': 2, 'wh': 2}
2023-03-02 05:47:38,964 [nnabla][INFO]: batch size per gpu: 32
^M 0%| | 0/3697 [00:00<?, ?it/s]^M 0%| | 0/3697 [00:04<?, ?it/s]
Traceback (most recent call last):
File "nnabla-examples/object-detection/centernet/src/main.py", line 147, in <module>
main(opt)
File "nnabla-examples/object-detection/centernet/src/main.py", line 112, in main
_ = trainer.update(epoch)
File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 191, in update
total_loss, hm_loss, wh_loss, off_loss = self.compute_gradient(
File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 178, in compute_gradient
return self.compute_gradient(data)
File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 178, in compute_gradient
return self.compute_gradient(data)
File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 178, in compute_gradient
return self.compute_gradient(data)
[Previous line repeated 7 more times]
File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 175, in compute_gradient
raise RuntimeError(
RuntimeError: Something went wrong with gradient calculations.
--------------------------------------------------------------------------
How to solve
Using a newer cuDNN version solved this issue.
- Use
nnabla/nnabla-ext-cuda-multi-gpu:py310-cuda116-mpi3.1.6-v1.34.0
as the base image and install the necessary packages. (see https://github.com/sony/nnabla-examples/blob/master/object-detection/centernet/requirements.txt) - The CUDA/cuDNN version of the above image is
cuda=11.6.0, CUDNN_VERSION=8.4.0.27