darknet icon indicating copy to clipboard operation
darknet copied to clipboard

cuDNN Error: CUDNN_STATUS_BAD_PARAM while training

Open divid3d opened this issue 4 years ago • 6 comments

While training in moment that mAP calculation should start I have assertion error: (next mAP calculation at 1900 iterations) 1900: 185.299484, 158.218445 avg loss, 0.001000 rate, 6.550167 seconds, 121600 images, 16.111683 hours left 4cuDNN Error: CUDNN_STATUS_BAD_PARAM: File exists darknet: ./src/utils.c:331: error: Assertion 0 failed.

divid3d avatar Dec 13 '20 20:12 divid3d

I have the same issue. I noticed it happens when I have 2 classes. Right now I am training without the -map flag and it is training fine. It is just not validating against your 20% images in your test.txt I don't know if it is linked to darknet or Cudnn.

Lerseb avatar Dec 20 '20 02:12 Lerseb

EDIT: Didn't notice it's pjreedie repo. Opening issue on AlexeyAB fork got the same issue with 1080 ti, built with ZED 3.2, opencv 4.5 with CUDA. Training only for 1 class. Driver Version: 455.32.00, CUDA Version: 11.1, cuDNN 8.0.4 Issue didn't happend on cuda 10.1, ubuntu 18 on same setup without zed /darknet$ git log commit 14b196d4f2f73fb2f6f8c4019de9a0de00c5a27e

Crash on map: (next mAP calculation at 1000 iterations) 1000: 0.831169, 0.674900 avg loss, 0.001000 rate, 4.804868 seconds, 64000 images, 3.878937 hours left Resizing to initial size: 416 x 416 try to allocate additional workspace_size = 89.65 MB CUDA allocate done!

calculation mAP (mean average precision)... Detection layer: 139 - type = 28 Detection layer: 150 - type = 28 Detection layer: 161 - type = 28 4 cuDNN status Error in: file: /home/patryk/darknet/src/convolutional_kernels.cu : () : line: 533 : build time: Nov 7 2020 - 09:24:25

cuDNN Error: CUDNN_STATUS_BAD_PARAM cuDNN Error: CUDNN_STATUS_BAD_PARAM: Resource temporarily unavailable ./szkolenie.sh: line 2: 36330 Segmentation fault (core dumped) ./darknet detector train dataset/obj.data dataset/op14.cfg yolov4.conv.137 -map

niemiaszek avatar Dec 20 '20 04:12 niemiaszek

Was this ever understood and/or addressed?

command: darknet detector train yolov3-ambulance-setup.data yolov3-ambulance-train.cfg ./darknet53.conv.74 -dont_show -map 2> train_log.txt

GPU: Nvidia 2028 SUPER, OPENCV 4.5.5 with CUDA, training for 1 class Driver version 512.15 CUDA version: 11.6 cuDNN version 8.3

(next mAP calculation at 100 iterations) 100: 0.637962, 1.110619 avg loss, 0.001000 rate, 4.220000 seconds, 6400 images, 1.343723 hours left Resizing to initial size: 416 x 416 try to allocate additional workspace_size = 154.24 MB CUDA allocate done!

calculation mAP (mean average precision)... Detection layer: 82 - type = 28 Detection layer: 94 - type = 28 Detection layer: 106 - type = 28

cuDNN status Error in: file: C:\darknet\src\convolutional_kernels.cu : forward_convolutional_layer_gpu() : line: 555 : build time: Apr 8 2022 - 13:14:27

cuDNN Error: CUDNN_STATUS_BAD_PARAM

Remco-Terwal-Bose avatar Apr 08 '22 21:04 Remco-Terwal-Bose

@Remco-Terwal-Bose see mentioned issue above or here

niemiaszek avatar Apr 10 '22 09:04 niemiaszek

This is my own error message, just when it's about to calculate the mAP, everything crashes: (next mAP calculation at 1000 iterations) H1000/10000: loss=14.6 hours left=4.9 1000: 14.551660, 16.706905 avg loss, 0.002610 rate, 1.315968 seconds, 64000 images, 4.874600 hours left 4Darknet error location: ./src/convolutional_kernels.cu, forward_convolutional_layer_gpu(), line #541 cuDNN Error: CUDNN_STATUS_BAD_PARAM: Succes

Rizama03 avatar Oct 28 '23 12:10 Rizama03

I also encountered the error when use -map in command, "cuDNN status Error in: file: ./src/convolutional_kernels.cu function: forward_convolutional_layer_gpu() line: 541 cuDNN Error: CUDNN_STATUS_BAD_PARAM Darknet error location: ./src/convolutional_kernels.cu, forward_convolutional_layer_gpu(), line #541"

There are following solutions I found :

  1. Downgrade CUDA link reply7153
  2. use subdivisions=64 link
  3. other comment
  4. export CUDA_VISIBLE_DEVICES=0 link
  5. GPU architecture link

But still, I was able to run with the following: GPU=1 CUDNN=0 CUDNN_HALF=0 OPENCV=1

I don't know, Is it a correct way to do so?

@AlexeyAB

priyanka-iiti avatar Mar 16 '24 05:03 priyanka-iiti