darknet icon indicating copy to clipboard operation
darknet copied to clipboard

CUDNN error at training iteration 1000 when calculating mAP%

Open stephanecharette opened this issue 2 years ago • 24 comments

Upgraded my Ubuntu 20.04 training rig to install latest patches. This included a new version of CUDNN. Now using CUDA 11.7.1-1 and CUDNN 8.5.0.96-1+cuda11.7. Darknet is at latest version from 2022-08-16:

> git log -1
commit 96f08de6839eb1c125c7b86bffe1d3dde9570e5b (HEAD -> master, origin/master, origin/HEAD)
Author: Stefano Sinigardi <[email protected]>
Date:   Tue Aug 16 20:20:48 2022 +0200

All of my existing neural networks fail to train. Some are YOLOv4-tiny, others are YOLOv4-tiny-3L. Training rig is nvidia 3090 with 24 GB of vram, and networks fit well in vram. When darknet gets to iteration 1000 in training where it does the first mAP calculation, it produces this error:

 (next mAP calculation at 1000 iterations) 
 1000: 1.540665, 2.618338 avg loss, 0.002600 rate, 1.743389 seconds, 64000 images, 2.605252 hours left
4Darknet error location: ./src/dark_cuda.c, cudnn_check_error, line #204
cuDNN Error: CUDNN_STATUS_BAD_PARAM: Success

 calculation mAP (mean average precision)...
 Detection layer: 30 - type = 28 
 Detection layer: 37 - type = 28 
 Detection layer: 44 - type = 28 

 cuDNN status Error in: file: ./src/convolutional_kernels.cu : () : line: 543 : build time: Sep 13 2022 - 17:44:16 

 cuDNN Error: CUDNN_STATUS_BAD_PARAM
Command exited with non-zero status 1

The only important thing I can think which has changed today is that I installed the latest version of CUDNN8. This is the relevant portion of the upgrade log:

Preparing to unpack .../04-libcudnn8-dev_8.5.0.96-1+cuda11.7_amd64.deb ...
update-alternatives: removing manually selected alternative - switching libcudnn to auto mode
Unpacking libcudnn8-dev (8.5.0.96-1+cuda11.7) over (8.4.1.50-1+cuda11.6) ...
Preparing to unpack .../05-libcudnn8_8.5.0.96-1+cuda11.7_amd64.deb ...
Unpacking libcudnn8 (8.5.0.96-1+cuda11.7) over (8.4.1.50-1+cuda11.6) ...

Curious to know if anyone else has a problem with CUDNN 8.5.0.96, or have an idea as to how to fix this problem.

stephanecharette avatar Sep 14 '22 02:09 stephanecharette

Downgraded CUDNN from 8.5.0 back to 8.4.1.50. Training works again. This is the command I used to downgrade:

sudo apt-get install libcudnn8-dev=8.4.1.50-1+cuda11.6 libcudnn8=8.4.1.50-1+cuda11.6

stephanecharette avatar Sep 14 '22 03:09 stephanecharette

The latest version of cudnn always has various bugs

1027663760 avatar Sep 14 '22 07:09 1027663760

modify copy_weights_net(...) of network.c

void copy_weights_net(network net_train, network* net_map) { int k;

for (k = 0; k < net_train.n; ++k)
{
    layer* l = &(net_train.layers[k]);
    layer tmp_layer;

    copy_cudnn_descriptors(net_train.layers[k], &tmp_layer);
    net_map->layers[k] = net_train.layers[k];
    copy_cudnn_descriptors(tmp_layer, &net_train.layers[k]);

    if (l->type == CRNN)
    {
        layer tmp_input_layer, tmp_self_layer, tmp_output_layer;

        copy_cudnn_descriptors(*net_train.layers[k].input_layer, &tmp_input_layer);
        copy_cudnn_descriptors(*net_train.layers[k].self_layer, &tmp_self_layer);
        copy_cudnn_descriptors(*net_train.layers[k].output_layer, &tmp_output_layer);
        net_map->layers[k].input_layer = net_train.layers[k].input_layer;
        net_map->layers[k].self_layer = net_train.layers[k].self_layer;
        net_map->layers[k].output_layer = net_train.layers[k].output_layer;
        //net_map->layers[k].output_gpu = net_map->layers[k].output_layer->output_gpu;  // already copied out of if()

        copy_cudnn_descriptors(tmp_input_layer, net_train.layers[k].input_layer);
        copy_cudnn_descriptors(tmp_self_layer, net_train.layers[k].self_layer);
        copy_cudnn_descriptors(tmp_output_layer, net_train.layers[k].output_layer);
    }
    else if (l->input_layer) // for AntiAliasing
    {
        layer tmp_input_layer;

        copy_cudnn_descriptors(*net_train.layers[k].input_layer, &tmp_input_layer);
        net_map->layers[k].input_layer = net_train.layers[k].input_layer;
        copy_cudnn_descriptors(tmp_input_layer, net_train.layers[k].input_layer);
    }

    net_map->layers[k].batch = 1;
    net_map->layers[k].steps = 1;
    net_map->layers[k].train = 0;
}

}

chgoatherd avatar Sep 16 '22 04:09 chgoatherd

Please refer to issues #8667

chgoatherd avatar Sep 16 '22 04:09 chgoatherd

Tried to use libcudnn8 8.6.0.163 today with cuda 11.8. Same problem still exists, aborts when it hits iteration # 1000. Used the command in the comment above and downgraded to lubcudnn8 8.4.1.50. Problem went away. This needs to be fixed...

https://github.com/AlexeyAB/darknet/issues/8669#issuecomment-1246194925

stephanecharette avatar Oct 23 '22 05:10 stephanecharette

@AlexeyAB do you have thoughts on the fix for this? Do you need a pull request for @chgoatherd's proposed changes, or is this going down the wrong path?

stephanecharette avatar Oct 23 '22 05:10 stephanecharette

  • Environment: container - nvidia/cuda:11.3.1-cudnn8-devel-centos7
  • Branch:
Author: St<C3><A9>phane Charette <[email protected]>
Date:   Wed Sep 21 04:03:47 2022 -0700

    Make sure best.weights is the most recent weights for a given mAP% (#8670)
    
    * issue #8308: memory leaks in map
    
    * update the window title with some training stats
    
    * make sure _best.weights is the most recent weights with that mAP%
  • GPU: NVIDIA Tesla V100
  • Command: ./darknet detector train data/obj.data cfg/yolov4-obj.cfg yolov4.conv.137 -dont_show -map

same problem as issue, but solved after applying @chgoatherd's suggestion + change subdivision=16 → 32
(changing subdivision value is not related to this issue but CUDA OOM error)

ryj0902 avatar Oct 25 '22 04:10 ryj0902

I got the same error here at 1000 iterations, firstly I just got the -map option out. The training is going well after 1000 iterations.

Later, I downgraded the cudnn and it worked. In my case I'm using CUDA 11.2, with a container

  • GPU: 1050 Ti 4GB
  • Docker image: nvidia/cuda:11.2.0-devel-ubuntu20.04
  • Cudnn packages: libcudnn8=8.1.1.33-1+cuda11.2 and libcudnn8-dev=8.1.1.33-1+cuda11.2

nailsonlinux avatar Oct 30 '22 01:10 nailsonlinux

I got the same error, Docker

hnothing2016 avatar Oct 30 '22 14:10 hnothing2016

Same problem. Ubuntu 20.04 GPU RTX A6000 46GB Nvidia Driver: 515.65.01 Makefile: GPU 1, CUDNN 1, CUDNN_HALF 1, OPENCV 1, OPENMP 1, LIBSO 1 ARCH= -gencode arch=compute_86,code=[sm_86,compute_86] Cuda Toolkit 11.[7-8] + cuDNN 8.[6-7-8] works only if you use subdivision=batch=64 or in case subdivision is smaller than batch you remove the "-map" parameter on the darknet training command. Then I followed stephanecharette instructions and downgraded to the versions Cuda Toolkit 11.6 + cuDNN 8.4.1 and now everything works great with “-map” even decreasing the subdivision value down to 8.

mari9myr avatar Nov 17 '22 08:11 mari9myr

@chgoatherd your solution and then a rebuild fixed my issues when using -map with cuda 11.x on a 3090 as well. Solid. You should create a pull request and get that merged in

jackneil avatar Jan 23 '23 12:01 jackneil

@chgoatherd Thanks a lot. I hit the same problem and your solution helped me to solve. I changed network.c and recompile darknet with vcpkg, CUDA v11.8 and CUDNN v8.6 on Windows 11. Now, everything works fine.

avkwok avatar Feb 21 '23 08:02 avkwok

I've made the changes that @chgoatherd listed above, switching out net_map->... for net_train... in that function. But I'm still seeing the same error when it attempts to calculate the map at iteration 1000.

stephanecharette avatar May 06 '23 01:05 stephanecharette

I'm using Ubuntu 20.04.6, CUDA 12.1.105-1, and CUDNN 8.9.1.23-1+cuda12.1. With the changes to network.c from @chgoatherd listed above from 2022-09-15, the error looks like this:

 calculation mAP (mean average precision)...
 Detection layer: 30 - type = 28 
 Detection layer: 37 - type = 28 
4CUDA status Error: file: ./src/network_kernels.cu: func: network_predict_gpu() line: 735

 CUDA Error: an illegal memory access was encountered
Darknet error location: ./src/network_kernels.cu, network_predict_gpu(), line #735
CUDA Error: an illegal memory access was encountered: Success
backtrace (11 entries)
1/11: darknet(log_backtrace+0x38) [0x562adf9d1dd8]
2/11: darknet(error+0x3d) [0x562adf9d1ebd]
3/11: darknet(check_error+0xd0) [0x562adf9d4eb0]
4/11: darknet(check_error_extended+0x7c) [0x562adf9d4f9c]
5/11: darknet(network_predict_gpu+0x15f) [0x562adfad509f]
6/11: darknet(validate_detector_map+0x9ad) [0x562adfa64f6d]
7/11: darknet(train_detector+0x16a4) [0x562adfa67ca4]
8/11: darknet(run_detector+0x897) [0x562adfa6bc57]
9/11: darknet(main+0x34d) [0x562adf98663d]
10/11: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f433e662083]
11/11: darknet(_start+0x2e) [0x562adf9888be]

Without the changes to network.c, the error looks like this:

 calculation mAP (mean average precision)...
 Detection layer: 30 - type = 28 
 Detection layer: 37 - type = 28 
4
 cuDNN status Error in: file: ./src/convolutional_kernels.cu function: forward_convolutional_layer_gpu() line: 543

 cuDNN Error: CUDNN_STATUS_BAD_PARAM
Darknet error location: ./src/convolutional_kernels.cu, forward_convolutional_layer_gpu(), line #543
cuDNN Error: CUDNN_STATUS_BAD_PARAM: Success
backtrace (13 entries)
1/13: darknet(log_backtrace+0x38) [0x5588eb21bdd8]
2/13: darknet(error+0x3d) [0x5588eb21bebd]
3/13: darknet(+0x8bd40) [0x5588eb21ed40]
4/13: darknet(cudnn_check_error_extended+0x7c) [0x5588eb21f2fc]
5/13: darknet(forward_convolutional_layer_gpu+0x2c2) [0x5588eb307802]
6/13: darknet(forward_network_gpu+0x101) [0x5588eb31c281]
7/13: darknet(network_predict_gpu+0x131) [0x5588eb31f0a1]
8/13: darknet(validate_detector_map+0x9ad) [0x5588eb2aef9d]
9/13: darknet(train_detector+0x16a4) [0x5588eb2b1cd4]
10/13: darknet(run_detector+0x897) [0x5588eb2b5c87]
11/13: darknet(main+0x34d) [0x5588eb1d063d]
12/13: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fb67bcba083]
13/13: darknet(_start+0x2e) [0x5588eb1d28be]

So the call stack and the error message from CUDA/CUDNN are not exactly the same. I think there are multiple issues, and the changes from above exposes the next problem.

IMPORTANT

People looking for a quick workaround for this issue, especially if training on hardware you don't own like Google colab where it is complicated to downgrade CUDA/CUDNN:

  • edit Darknet's Makefile
  • set CUDNN=0
  • set CUDNN_HALF=0
  • rebuild Darknet

This is not ideal, but will get you past the problem until a solution is found.

stephanecharette avatar May 07 '23 00:05 stephanecharette

The release notes for CUDNN v8.5.0 -- where the problem started -- contains this text:

A buffer was shared between threads and caused segmentation faults. There was previously no way to have a per-thread buffer to avoid these segmentation faults. The buffer has been moved to the cuDNN handle. Ensure you have a cuDNN handle for each thread because the buffer in the cuDNN handle is only for the use of one thread and cannot be shared between two threads.

This sounds like a possible issue. I believe the cudnn handle is initialized in dark_cuda.c, and it looks like it is a global variable shared between all threads. See the two calls to cudnnCreate(), as well as the variables cudnnInit, cudnnHandle, switchCudnnInit and switchCudnnhandle.

stephanecharette avatar May 07 '23 03:05 stephanecharette

Until a proper solution is found, this is still the solution I employ on my training rigs:

sudo apt-get install libcudnn8-dev=8.4.1.50-1+cuda11.6 libcudnn8=8.4.1.50-1+cuda11.6 sudo apt-mark hold libcudnn8-dev sudo apt-mark hold libcuddn8

As stated 2 comments above, another possible workaround is to disable CUDNN in the Darknet Makefile.

stephanecharette avatar May 08 '23 18:05 stephanecharette

I made the change @chgoatherd suggested above, and it seems to work on Ubuntu 22.04.2 LTS with CUDA 11.7 + CUDNN 8.9.0

avmusat avatar May 09 '23 09:05 avmusat

Unfortunately several (not all) of my neural network still cause the error to happen even with those changes.

stephanecharette avatar May 09 '23 09:05 stephanecharette

Wondering if the fixes made here might finally solve this issue: https://github.com/hank-ai/darknet/commit/1ea2baf0795c22804e1ef69ddc1d7b1e73d80b0d

stephanecharette avatar Jun 25 '23 22:06 stephanecharette

Preliminary tests show that this appears to have been fixed by that commit. See the Hank.ai Darknet repo. https://github.com/hank-ai/darknet/commit/1ea2baf0795c22804e1ef69ddc1d7b1e73d80b0d

stephanecharette avatar Jun 26 '23 04:06 stephanecharette

Thanks for sharing it @stephanecharette!

nailsonlinux avatar Jun 26 '23 05:06 nailsonlinux

how to solve this problem?

asyilanrftr avatar Oct 17 '23 04:10 asyilanrftr

I have the same error as you on RTX3060, RTX2070 Super works normally, have you fixed it yet?

xxtkidxx avatar Mar 11 '24 07:03 xxtkidxx

Yes, this is fixed in the new Darknet/YOLO repo: https://github.com/hank-ai/darknet#table-of-contents

stephanecharette avatar Mar 11 '24 08:03 stephanecharette