darknet CUDNN_STATUS_BAD

When I try to use the command: detector train .....-map, my code always goes wrong.

However, if I try to use detector map .... ,my code works well.

Configuraiton: CUDA 10.2 CUDNN 8.0 GTX 2080 TI VS 2019

Oct 15 '20 03:10 lostagex

i met same error. downgrade my darknet version

Oct 16 '20 08:10 Hwijune

(next mAP calculation at 1103 iterations) 
1103: 0.438374, 0.477862 avg loss, 0.001000 rate, 0.383500 seconds, 70592 images, 54.548955 hours left
Resizing to initial size: 416 x 416  try to allocate additional workspace_size = 287.90 MB 
CUDA allocate done! 

calculation mAP (mean average precision)...
Detection layer: 16 - type = 28 
Detection layer: 23 - type = 28 
4
cuDNN status Error in: file: ./src/convolutional_kernels.cu : () : line: 545 : build time: Oct 19 2020 - 15:51:06 

cuDNN Error: CUDNN_STATUS_BAD_PARAM
cuDNN Error: CUDNN_STATUS_BAD_PARAM: Resource temporarily unavailable
darknet: ./src/utils.c:325: error: Assertion `0' failed.
./exp1_training.sh: line 1:  3947 Aborted                 (core dumped) ./darknet detector train data/exp1.data cfg/exp1.cfg yolov3-tiny.conv.15 -map

Same problem here. It randomly halted in about 1000 rounds. Only run one training. GTX-1080Ti + CUDA 11.1 + CUDNN 8.0.4.30 + NVIDIA driver 455.23.05 + Ubuntu 18.04 Also I tried CUDA 10.2 with the same CUDNN, but the problem remains. Used nvidia-smi to monitor the training and I'm sure it's not any OOM or GPU overheating problem.

Currently my workaround is to disable CUDNN flag in Makefile and re-compile but in this way it really gets slower when performing the training especially when running two training processes.

Oct 20 '20 01:10 cayman1021

(next mAP calculation at 1103 iterations) 
1103: 0.438374, 0.477862 avg loss, 0.001000 rate, 0.383500 seconds, 70592 images, 54.548955 hours left
Resizing to initial size: 416 x 416  try to allocate additional workspace_size = 287.90 MB 
CUDA allocate done! 

calculation mAP (mean average precision)...
Detection layer: 16 - type = 28 
Detection layer: 23 - type = 28 
4
cuDNN status Error in: file: ./src/convolutional_kernels.cu : () : line: 545 : build time: Oct 19 2020 - 15:51:06 

cuDNN Error: CUDNN_STATUS_BAD_PARAM
cuDNN Error: CUDNN_STATUS_BAD_PARAM: Resource temporarily unavailable
darknet: ./src/utils.c:325: error: Assertion `0' failed.
./exp1_training.sh: line 1:  3947 Aborted                 (core dumped) ./darknet detector train data/exp1.data cfg/exp1.cfg yolov3-tiny.conv.15 -map
Same problem here. It randomly halted in about 1000 rounds. Only run one training. GTX-1080Ti + CUDA 11.1 + CUDNN 8.0.4.30 + NVIDIA driver 455.23.05 + Ubuntu 18.04 Also I tried CUDA 10.2 with the same CUDNN, but the problem remains. Used nvidia-smi to monitor the training and I'm sure it's not any OOM or GPU overheating problem.

Currently my workaround is to disable CUDNN flag in Makefile and re-compile but in this way it really gets slower when performing the training especially when running two training processes.

i'm using 10.2, cudnn 8.0.4, ubuntu 18.04, gtx-2080ti

I rather had an error in Cuda 11.

Oct 21 '20 02:10 Hwijune

I am facing the same problem. I am following the tutorial on training YOLOv4 on Pascal VOC. Everything seemed to be working during training, but it crashed when trying to calculate the mAP after 1034 iterations:

 calculation mAP (mean average precision)...
 Detection layer: 139 - type = 28 
 Detection layer: 150 - type = 28 
 Detection layer: 161 - type = 28 
4
 cuDNN status Error in: file: ./src/convolutional_kernels.cu : () : line: 545 : build time: Nov  2 2020 - 20:17:33 

 cuDNN Error: CUDNN_STATUS_BAD_PARAM
cuDNN Error: CUDNN_STATUS_BAD_PARAM: Success
darknet: ./src/utils.c:325: error: Assertion `0' failed.
Aborted (core dumped)

@cayman1021 Did you try downgrading CUDA as suggested by @Hwijune?

Some information regarding my system:

Machine: MS Azure VM (Standard NC12, 12 vcpus, 112 GiB memory, 2x Tesla K80 with 12 GiB)
OS: Ubuntu 16.04
CUDA: 11.1
cuDNN: 8.0.4
Makefile configuration:

GPU=1
CUDNN=1
OPENCV=1
OPENMP=1

...

ARCH= -gencode arch=compute_37,code=sm_37

...

Configuration file (adapted from yolov4-custom.cfg):

(edit: fixed the number of subdivisions, it was 32)

[net]
batch=64
subdivisions=32
width=416
height=416
channels=3
momentum=0.949
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.001
burn_in=1000
max_batches = 40000
policy=steps
steps=32000,36000
scales=.1,.1

...

Nov 03 '20 01:11 silasalves

chart_yolov4

I have somehow solved my issue by revising the following codes in detector.c

//mean_average_precision = validate_detector_map(datacfg, cfgfile, weightfile, 0.25, 0.5, 0, net.letter_box, &net_map); mean_average_precision = validate_detector_map(datacfg, cfgfile, weightfile, 0.25, 0.5, 0, net.letter_box, NULL);

Hope this can work well in your situation.

Have a good day!

Nov 03 '20 02:11 lostagex

Thanks, @lostagex, that change did work for me. =)

However, I noticed that the validation step is taking place on the CPU, so it is absurdly slow. And even though I compiled it with OPENMP=1, darknet is using only one core.

Do you see the same behavior on your system?

Nov 04 '20 00:11 silasalves

Thanks, @lostagex, that change did work for me. =)

However, I noticed that the validation step is taking place on the CPU, so it is absurdly slow. And even though I compiled it with OPENMP=1, darknet is using only one core.

Do you see the same behavior on your system?

The same issue to me. It is really slow and it takes me nearly three days to finish the training process （15000 iterations）.

Nov 04 '20 01:11 lostagex

@lostagex I think I figured out the problem. After reverting the changes I made to src/detector.c, I tried downgrading CUDA from version 11.1 to 10.2 as suggested by @Hwijune, but that didn't solve the problem. Darknet had the same behavior regardless of the CUDA version.

I noticed that although it was crashing, it was still able to process 4 images, so I decided to increase the number of subdivisions from 32 to 64. The beginning of my cfg file is:

[net]
batch=64
subdivisions=64
width=608
height=608
channels=3
momentum=0.949
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.001
burn_in=1000
max_batches = 40000
policy=steps
steps=32000,36000
scales=.1,.1

mosaic=1

...

That change allowed the mAP to be calculated on the GPU, which obviously made everything faster. I am training on two GPU, but the mAP calculation uses only one of them. I also observed that darknet is now using less GPU memory during training (due to higher number of subdivisions), so I don't know how much that will impact the performance.

Nov 05 '20 00:11 silasalves

OK, thanks a lot. @silasalves I will try to reduce subdivisions. Hope the performance will not depress us.

@lostagex I think I figured out the problem. After reverting the changes I made to src/detector.c, I tried downgrading CUDA from version 11.1 to 10.2 as suggested by @Hwijune, but that didn't solve the problem. Darknet had the same behavior regardless of the CUDA version.

I noticed that although it was crashing, it was still able to process 4 images, so I decided to increase the number of subdivisions from 32 to 64. The beginning of my cfg file is:
[net]
batch=64
subdivisions=64
width=608
height=608
channels=3
momentum=0.949
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.001
burn_in=1000
max_batches = 40000
policy=steps
steps=32000,36000
scales=.1,.1

mosaic=1

...
That change allowed the mAP to be calculated on the GPU, which obviously made everything faster. I am training on two GPU, but the mAP calculation uses only one of them. I also observed that darknet is now using less GPU memory during training (due to higher number of subdivisions), so I don't know how much that will impact the performance.

Nov 05 '20 00:11 lostagex

mean_average_precision = validate_detector_map(datacfg, cfgfile, weightfile, 0.25, 0.5, 0, net.letter_box, NULL); failed.

./darknet detector test ./cfg/coco.data cfg/yolov4-tiny.cfg yolov4-tiny.weights data/dog.jpg CUDA-version: 10010 (10010), cuDNN: 7.6.5, GPU count: 2 OpenCV version: 3.2.0 0 : compute_capability = 750, cudnn_half = 0, GPU: GeForce RTX 2080 Ti net.optimized_memory = 0 mini_batch = 1, batch = 1, time_steps = 1, train = 0 layer filters size/strd(dil) input output 0 cuDNN status Error in: file: ./src/dark_cuda.c : () : line: 171 : build time: Dec 11 2020 - 20:25:27

cuDNN Error: CUDNN_STATUS_BAD_PARAM cuDNN Error: CUDNN_STATUS_BAD_PARAM: File exists darknet: ./src/utils.c:325: error: Assertion `0' failed. [1] 105369 abort (core dumped) ./darknet detector test ./cfg/coco.data cfg/yolov4-tiny.cfg data/dog.jpg

Dec 11 '20 12:12 sisrfeng

I have the same problem (training crash with Cudnn error when calculating map) It starts happening as soon as I had more than one class. I spent hours trying to downgrade Cuda and Cudnn and finally came back to my initial setting (Cuda11.1 Cudnn8.0.5.39). I solved it by removing the -map flag in the command. It is now training perfect but it is not validating the performance of the model against my 20% images listed in the test.txt file. But is is not a problem at all as the map value depends a lot on the test images selection. I don't know if it is coming from Darknet or Cudnn. There might be a fix downgrading one or the other.

Dec 20 '20 03:12 Lerseb

Any docker image availabel? Many thx!

Dec 20 '20 07:12 sisrfeng

i met same error. downgrade my darknet version

when you did so, did it solve your issue

Oct 17 '23 06:10 Rizama03

darknet darknet copied to clipboard

CUDNN_STATUS_BAD_PARAM

darknet
darknet copied to clipboard