darknet darknet crashes when calculating mAP% at iteration #1000

User "cmorzy" reported today that they're still seeing the error/crash when Darknet reaches iteration #1000. A copy of the dataset, .names, and .cfg is available.

The exact message they're seeing is:

* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* A fatal error has been detected.  Darknet will now exit.
* Error location: ./src/convolutional_kernels.cu, forward_convolutional_layer_gpu(), line #546
* Error message:  cuDNN current error: status=3, CUDNN_STATUS_BAD_PARAM
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
backtrace (13 entries):
1/13: ./darknet(log_backtrace+0x38) [0x560b3fb79128]
2/13: ./darknet(darknet_fatal_error+0x19d) [0x560b3fb7936d]
3/13: ./darknet(cudnn_check_error_extended+0x83) [0x560b3fb7bf83]
4/13: ./darknet(forward_convolutional_layer_gpu+0x2c5) [0x560b3fc56985]
5/13: ./darknet(forward_network_gpu+0xe1) [0x560b3fc6af81]
6/13: ./darknet(network_predict_gpu+0x140) [0x560b3fc6d800]
7/13: ./darknet(validate_detector_map+0xa49) [0x560b3fc02f29]
8/13: ./darknet(train_detector+0x1ce0) [0x560b3fc05f70]
9/13: ./darknet(run_detector+0x9f6) [0x560b3fc09996]
10/13: ./darknet(main+0x4b3) [0x560b3fb308b3]
11/13: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f6ed5bd7d90]
12/13: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f6ed5bd7e40]
13/13: ./darknet(_start+0x25) [0x560b3fb32b25]
Segmentation fault (core dumped)

Jul 17 '23 17:07 stephanecharette

This is a continuation of https://github.com/AlexeyAB/darknet/issues/8669

Jul 17 '23 17:07 stephanecharette

Using:

libcudnn8=8.5.0.96-1+cuda11.7 libcudnn8-dev=8.5.0.96-1+cuda11.7

But also recreated using 8.9.3.28-1+cuda11.8.

Jul 17 '23 17:07 chrislytras

Me too:

Ubuntu 22.04.3

libcudnn8=8.9.4.25-1+cuda12.2 libcudnn8-dev=8.9.4.25-1+cuda12.2

`... -> next mAP calculation will be at iteration #1000 Tensor Cores are disabled until iteration #3000. 1000: loss=4.558, avg loss=4.317, rate=0.001000, 103.801 milliseconds, 32000 images, time remaining=30 hours

calculating mAP (mean average precision)... Detection layer #30 is type 28 (yolo) Detection layer #37 is type 28 (yolo) using 4 threads to load 420 validation images for mAP% calculations processing #0 (0%) cuDNN status error in /home/user/src/darknet/src/convolutional_kernels.cu, forward_convolutional_layer_gpu(), line #554

A fatal error has been detected. Darknet will now exit.
Errno 2: No such file or directory
Error location: /home/user/src/darknet/src/convolutional_kernels.cu, forward_convolutional_layer_gpu(), line #554
Error message: cuDNN current error: status=3, CUDNN_STATUS_BAD_PARAM
Version v2.0-4-g7d84f744 built on Sep 8 2023 09:13:21

backtrace (13 entries): 1/13: darknet(_Z13log_backtracev+0x38) [0x55b121550ce8] 2/13: darknet(darknet_fatal_error+0x1bd) [0x55b121550f4d] 3/13: darknet(cudnn_check_error_extended+0x83) [0x55b1214982b3] 4/13: darknet(forward_convolutional_layer_gpu+0x2d5) [0x55b12148bce5] 5/13: darknet(forward_network_gpu+0xe1) [0x55b12152b9d1] 6/13: darknet(network_predict_gpu+0x140) [0x55b12152e660] 7/13: darknet(validate_detector_map+0xa06) [0x55b1214afa56] 8/13: darknet(train_detector+0x1475) [0x55b1214b2185] 9/13: darknet(_Z12run_detectoriPPc+0xa85) [0x55b1214b60f5] 10/13: darknet(main+0x4a1) [0x55b1214454e1] 11/13: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f6dd2e29d90] 12/13: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f6dd2e29e40] 13/13: darknet(_start+0x25) [0x55b121447ef5] Command exited with non-zero status 1`

Sep 08 '23 10:09 sinyb

You probably know this but it usually works if you set subdivisions to 64. Just leaves a lot of wasted memory on the card and quadruples training time. Thanks for trying on this, this is probably the biggest pain in the ass for the last two years with darknet. I gave up and wrote bash scripts to stop, run map, post it online and start training again. Would be nice to get map in training working well.

Sep 18 '23 10:09 kdill00

Just to note i have tried and experienced this on cuda 11.4 through 12.2 currently over last year and a half with all kinds of datasets. Smaller training resolution and higher subdivisions will allow it to work most times but like previous post it increases train time too much.

Sep 18 '23 10:09 kdill00

Just to note i have tried and experienced this on cuda 11.4 through 12.2 currently over last year and a half with all kinds of datasets. Smaller training resolution and higher subdivisions will allow it to work most times but like previous post it increases train time too much.

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcudnn8_8.4.1.50-1+cuda11.6_amd64.deb
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcudnn8-dev_8.4.1.50-1+cuda11.6_amd64.deb

This should do it.

Sep 18 '23 10:09 chrislytras

Ill give it a try thank you

Sep 22 '23 02:09 kdill00

If this error occurs, the config [net] burn_in=1000

If you set this value to 800, a similar error will occur on the 800th try. Also, if you set this value to 100, result is same.

but if you set subdevision to non x2 scales such like 6, 10 , this error is not occur. I think it's a problem with the results of internal multiplication or division. The burn-in result may be an error due to the number of training files or other factors.

Oct 02 '23 03:10 suminoshi

Just as a side note, some people recommended setting the minibatch to 64 in order to avoid this problem, and it would work like that; however, take into consideration that with the minibatch set to 64, there is a potential danger for the model to overfit the training data more easily. This is particularly a concern if your dataset isn't very large or diverse.

Sep 12 '24 11:09 Rares926

The problem was solved long ago with the new Darknet repo. Please use https://github.com/hank-ai/darknet as that repo is maintained and correctly solves this issue.

Sep 12 '24 12:09 stephanecharette

This memory issue should be fixed in V2.

In V3, the fix was modified for performance reasons. If this problem comes back in V3, please see the comment block in cudnn_convolutional_setup() within convolutional_layer.cpp. Specifically, the fix is where the variable compu_capability_ver gets used.

Instead of using the major/minor version numbers, perhaps we should be better at calculating memory usage and deciding which algorithm to include?

Sep 21 '24 21:09 stephanecharette