darknet
darknet copied to clipboard
CUDNN_STATUS_BAD_PARAM
When I try to use the command: detector train .....-map, my code always goes wrong.
However, if I try to use detector map .... ,my code works well.
Configuraiton: CUDA 10.2 CUDNN 8.0 GTX 2080 TI VS 2019
i met same error. downgrade my darknet version
(next mAP calculation at 1103 iterations)
1103: 0.438374, 0.477862 avg loss, 0.001000 rate, 0.383500 seconds, 70592 images, 54.548955 hours left
Resizing to initial size: 416 x 416 try to allocate additional workspace_size = 287.90 MB
CUDA allocate done!
calculation mAP (mean average precision)...
Detection layer: 16 - type = 28
Detection layer: 23 - type = 28
4
cuDNN status Error in: file: ./src/convolutional_kernels.cu : () : line: 545 : build time: Oct 19 2020 - 15:51:06
cuDNN Error: CUDNN_STATUS_BAD_PARAM
cuDNN Error: CUDNN_STATUS_BAD_PARAM: Resource temporarily unavailable
darknet: ./src/utils.c:325: error: Assertion `0' failed.
./exp1_training.sh: line 1: 3947 Aborted (core dumped) ./darknet detector train data/exp1.data cfg/exp1.cfg yolov3-tiny.conv.15 -map
Same problem here. It randomly halted in about 1000 rounds. Only run one training. GTX-1080Ti + CUDA 11.1 + CUDNN 8.0.4.30 + NVIDIA driver 455.23.05 + Ubuntu 18.04 Also I tried CUDA 10.2 with the same CUDNN, but the problem remains. Used nvidia-smi to monitor the training and I'm sure it's not any OOM or GPU overheating problem.
Currently my workaround is to disable CUDNN flag in Makefile and re-compile but in this way it really gets slower when performing the training especially when running two training processes.
(next mAP calculation at 1103 iterations) 1103: 0.438374, 0.477862 avg loss, 0.001000 rate, 0.383500 seconds, 70592 images, 54.548955 hours left Resizing to initial size: 416 x 416 try to allocate additional workspace_size = 287.90 MB CUDA allocate done! calculation mAP (mean average precision)... Detection layer: 16 - type = 28 Detection layer: 23 - type = 28 4 cuDNN status Error in: file: ./src/convolutional_kernels.cu : () : line: 545 : build time: Oct 19 2020 - 15:51:06 cuDNN Error: CUDNN_STATUS_BAD_PARAM cuDNN Error: CUDNN_STATUS_BAD_PARAM: Resource temporarily unavailable darknet: ./src/utils.c:325: error: Assertion `0' failed. ./exp1_training.sh: line 1: 3947 Aborted (core dumped) ./darknet detector train data/exp1.data cfg/exp1.cfg yolov3-tiny.conv.15 -map
Same problem here. It randomly halted in about 1000 rounds. Only run one training. GTX-1080Ti + CUDA 11.1 + CUDNN 8.0.4.30 + NVIDIA driver 455.23.05 + Ubuntu 18.04 Also I tried CUDA 10.2 with the same CUDNN, but the problem remains. Used nvidia-smi to monitor the training and I'm sure it's not any OOM or GPU overheating problem.
Currently my workaround is to disable CUDNN flag in Makefile and re-compile but in this way it really gets slower when performing the training especially when running two training processes.
i'm using 10.2, cudnn 8.0.4, ubuntu 18.04, gtx-2080ti
I rather had an error in Cuda 11.
I am facing the same problem. I am following the tutorial on training YOLOv4 on Pascal VOC. Everything seemed to be working during training, but it crashed when trying to calculate the mAP after 1034 iterations:
calculation mAP (mean average precision)...
Detection layer: 139 - type = 28
Detection layer: 150 - type = 28
Detection layer: 161 - type = 28
4
cuDNN status Error in: file: ./src/convolutional_kernels.cu : () : line: 545 : build time: Nov 2 2020 - 20:17:33
cuDNN Error: CUDNN_STATUS_BAD_PARAM
cuDNN Error: CUDNN_STATUS_BAD_PARAM: Success
darknet: ./src/utils.c:325: error: Assertion `0' failed.
Aborted (core dumped)
@cayman1021 Did you try downgrading CUDA as suggested by @Hwijune?
Some information regarding my system:
-
Machine: MS Azure VM (Standard NC12, 12 vcpus, 112 GiB memory, 2x Tesla K80 with 12 GiB)
-
OS: Ubuntu 16.04
-
CUDA: 11.1
-
cuDNN: 8.0.4
-
Makefile configuration:
GPU=1
CUDNN=1
OPENCV=1
OPENMP=1
...
ARCH= -gencode arch=compute_37,code=sm_37
...
- Configuration file (adapted from yolov4-custom.cfg):
(edit: fixed the number of subdivisions, it was 32)
[net]
batch=64
subdivisions=32
width=416
height=416
channels=3
momentum=0.949
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1
learning_rate=0.001
burn_in=1000
max_batches = 40000
policy=steps
steps=32000,36000
scales=.1,.1
...
I have somehow solved my issue by revising the following codes in detector.c
//mean_average_precision = validate_detector_map(datacfg, cfgfile, weightfile, 0.25, 0.5, 0, net.letter_box, &net_map); mean_average_precision = validate_detector_map(datacfg, cfgfile, weightfile, 0.25, 0.5, 0, net.letter_box, NULL);
Hope this can work well in your situation.
Have a good day!
Thanks, @lostagex, that change did work for me. =)
However, I noticed that the validation step is taking place on the CPU, so it is absurdly slow. And even though I compiled it with OPENMP=1
, darknet is using only one core.
Do you see the same behavior on your system?
Thanks, @lostagex, that change did work for me. =)
However, I noticed that the validation step is taking place on the CPU, so it is absurdly slow. And even though I compiled it with
OPENMP=1
, darknet is using only one core.Do you see the same behavior on your system?
The same issue to me. It is really slow and it takes me nearly three days to finish the training process (15000 iterations).
@lostagex I think I figured out the problem. After reverting the changes I made to src/detector.c
, I tried downgrading CUDA from version 11.1 to 10.2 as suggested by @Hwijune, but that didn't solve the problem. Darknet had the same behavior regardless of the CUDA version.
I noticed that although it was crashing, it was still able to process 4 images, so I decided to increase the number of subdivisions from 32 to 64. The beginning of my cfg
file is:
[net]
batch=64
subdivisions=64
width=608
height=608
channels=3
momentum=0.949
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1
learning_rate=0.001
burn_in=1000
max_batches = 40000
policy=steps
steps=32000,36000
scales=.1,.1
mosaic=1
...
That change allowed the mAP to be calculated on the GPU, which obviously made everything faster. I am training on two GPU, but the mAP calculation uses only one of them. I also observed that darknet is now using less GPU memory during training (due to higher number of subdivisions), so I don't know how much that will impact the performance.
OK, thanks a lot. @silasalves I will try to reduce subdivisions. Hope the performance will not depress us.
@lostagex I think I figured out the problem. After reverting the changes I made to
src/detector.c
, I tried downgrading CUDA from version 11.1 to 10.2 as suggested by @Hwijune, but that didn't solve the problem. Darknet had the same behavior regardless of the CUDA version.I noticed that although it was crashing, it was still able to process 4 images, so I decided to increase the number of subdivisions from 32 to 64. The beginning of my
cfg
file is:[net] batch=64 subdivisions=64 width=608 height=608 channels=3 momentum=0.949 decay=0.0005 angle=0 saturation = 1.5 exposure = 1.5 hue=.1 learning_rate=0.001 burn_in=1000 max_batches = 40000 policy=steps steps=32000,36000 scales=.1,.1 mosaic=1 ...
That change allowed the mAP to be calculated on the GPU, which obviously made everything faster. I am training on two GPU, but the mAP calculation uses only one of them. I also observed that darknet is now using less GPU memory during training (due to higher number of subdivisions), so I don't know how much that will impact the performance.
mean_average_precision = validate_detector_map(datacfg, cfgfile, weightfile, 0.25, 0.5, 0, net.letter_box, NULL); failed.
./darknet detector test ./cfg/coco.data cfg/yolov4-tiny.cfg yolov4-tiny.weights data/dog.jpg CUDA-version: 10010 (10010), cuDNN: 7.6.5, GPU count: 2 OpenCV version: 3.2.0 0 : compute_capability = 750, cudnn_half = 0, GPU: GeForce RTX 2080 Ti net.optimized_memory = 0 mini_batch = 1, batch = 1, time_steps = 1, train = 0 layer filters size/strd(dil) input output 0 cuDNN status Error in: file: ./src/dark_cuda.c : () : line: 171 : build time: Dec 11 2020 - 20:25:27
cuDNN Error: CUDNN_STATUS_BAD_PARAM cuDNN Error: CUDNN_STATUS_BAD_PARAM: File exists darknet: ./src/utils.c:325: error: Assertion `0' failed. [1] 105369 abort (core dumped) ./darknet detector test ./cfg/coco.data cfg/yolov4-tiny.cfg data/dog.jpg
I have the same problem (training crash with Cudnn error when calculating map) It starts happening as soon as I had more than one class. I spent hours trying to downgrade Cuda and Cudnn and finally came back to my initial setting (Cuda11.1 Cudnn8.0.5.39). I solved it by removing the -map flag in the command. It is now training perfect but it is not validating the performance of the model against my 20% images listed in the test.txt file. But is is not a problem at all as the map value depends a lot on the test images selection. I don't know if it is coming from Darknet or Cudnn. There might be a fix downgrading one or the other.
Any docker image availabel? Many thx!
i met same error. downgrade my darknet version
when you did so, did it solve your issue