caffe icon indicating copy to clipboard operation
caffe copied to clipboard

Support of CuDNN8

Open artyom-beilis opened this issue 4 years ago • 8 comments

Support of CuDNN8

Some of the API that was used by Caffe was removed in cudnn8. Without it it is impossible to run Caffe on Ampre architecture.

It required:

  • switch to cudnnFind* API instead of cudnnGet* that was removed in version 8.
  • cache search results such that search of the alogrithms happens only in case shape really changed - otherwise reshape costs too much
  • fixed cudnn version search to support cudnn 8
  • added missing error code that was added in version 8

The change was tested on

  • 3070/cuda11.2/cudnn8.1
  • 1080/cuda8/cudnn7
  • 1080/cuda8/cudnn6

artyom-beilis avatar Apr 20 '21 07:04 artyom-beilis

Anybody here?

artyom-beilis avatar May 03 '21 09:05 artyom-beilis

@artyom-beilis Thanks for your patch! I have tried it the same as this https://github.com/BVLC/caffe/issues/6970. But encountered with a large memory utilization in case of cudnn8. After some tests I have tried a model with single conv layer and ( 20 * 3 * 1280 * 720 ) input, it's "head" of ResNet used for detection task. With cuda10 and cudnn7.6 I observed about 1.7Gb usage for a forward pass, for cuda 11 and cudnn8 ~ 2.6Gb. Maybe this comparison is not fully correct, because different GPUs were used, Titan XP in the first case and 3060 for the second. Have you seen something like this with 3070 and 1080? Thank you!

borisgribkov avatar Oct 14 '21 18:10 borisgribkov

Hi, I noticed larger memory use as well. It looks like related to cudnn8 in general. I see clear difference when I build same code with cudnn7 vs cudnn8. Also make sure you use latest alignment fix, i.e. latest branch: https://github.com/artyom-beilis/caffe/commits/fixes_for_cudnn8_bvlc_master

Also caffe in general is memory hug. AFAIR I noticed the difference in memory use of cudnn7 vs cudnn8 with other frameworks as well. Artyom 

@artyom-beilis Thanks for your patch! I have tried it the same as this #6970. But encountered with a large memory utilization in case of cudnn8. After some tests I have tried a model with single conv layer and ( 20 * 3 * 1280 * 720 ) input, it's "head" of ResNet used for detection task. With cuda10 and cudnn7.6 I observed about 1.7Gb usage for a forward pass, for cuda 11 and cudnn8 ~ 2.6Gb. Maybe this comparison is not fully correct, because different GPUs were used, Titan XP in the first case and 3060 for the second. Have you seen something like this with 3070 and 1080? Thank you!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

artyom-beilis avatar Oct 14 '21 19:10 artyom-beilis

AFAIR I noticed the difference in memory use of cudnn7 vs cudnn8 with other frameworks as well.

Could you tell more about other frameworks? I have tried to find similar GPU memory problems mentions but unsuccessful.

borisgribkov avatar Oct 14 '21 20:10 borisgribkov

AFAIR I noticed the difference in memory use of cudnn7 vs cudnn8 with other frameworks as well.

Could you tell more about other frameworks? I have tried to find similar GPU memory usage mentions but unsuccessful.

I don't really remember. It was either pytorch or mxnet. I don't recall. Was long time ago.

artyom-beilis avatar Oct 14 '21 20:10 artyom-beilis

Anyway, thank you! )

borisgribkov avatar Oct 14 '21 20:10 borisgribkov

Following this as current caffe I built with nvcr.io/nvidia/cuda:11.4.1-cudnn8-devel-ubuntu20.04 with OpenPose results to a much larger GPU RAM footprint on an AWS G5(Ampere).

kmmanto avatar Jan 11 '22 11:01 kmmanto

I tried the proposed changes to make cuDNN8 work but it does not work and the training immediately ends with the following error:

I0120 10:25:10.763470 1539595 solver.cpp:60] Solver scaffolding done.
I0120 10:25:10.765404 1539595 caffe.cpp:239] Starting Optimization
I0120 10:25:10.765410 1539595 solver.cpp:292] Solving squeezenet-ssd
I0120 10:25:10.765413 1539595 solver.cpp:293] Learning Rate Policy: poly
F0120 10:25:10.835502 1539595 cudnn_conv_layer.cu:118] Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0)  CUDNN_STATUS_INTERNAL_ERROR
*** Check failure stack trace: ***
    @     0x7f09cdf8f1c3  google::LogMessage::Fail()
    @     0x7f09cdf9425b  google::LogMessage::SendToLog()
    @     0x7f09cdf8eebf  google::LogMessage::Flush()
    @     0x7f09cdf8f6ef  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f09ce7753f0  caffe::CuDNNConvolutionLayer<>::Backward_gpu()
    @     0x7f09ce711c6a  caffe::Net<>::BackwardFromTo()
    @     0x7f09ce711da5  caffe::Net<>::Backward()
    @     0x7f09ce6ecaab  caffe::Solver<>::Step()
    @     0x7f09ce6ed492  caffe::Solver<>::Solve()
    @     0x55739e9b4a7a  train()
    @     0x55739e9b1eac  main
    @     0x7f09cd2fb083  __libc_start_main
    @     0x55739e9b290e  _start

Ubuntu 20.04 nVidia GeForce RTX 3060 12 GB Driver Version: 510.108.03 CUDA Version: 11.6 cuDNN Version: 8.6

Build without cuDNN runs without problems.

BigMuscle85 avatar Jan 20 '23 09:01 BigMuscle85