DALI icon indicating copy to clipboard operation
DALI copied to clipboard

dali cuda error when running NVDeepLearningExamples with MXNET_ENABLE_CUDA_GRAPHS=1

Open LSC527 opened this issue 3 years ago • 2 comments

code from https://github.com/NVIDIA/DeepLearningExamples/tree/master/MxNet/Classification/RN50v1.5 running with env var MXNET_ENABLE_CUDA_GRAPHS=1

[1,5]<stderr>:  _DaliBaseIterator.__init__(self,
[1,5]<stderr>:2022-02-24 04:23:12,251:WARNING: DALI iterator does not support resetting while epoch is not finished. Ignoring...
[1,5]<stderr>:2022-02-24 04:23:12,251:INFO: Starting epoch 0
[1,3]<stderr>:terminate called after throwing an instance of 'dali::CUDAError'
[1,3]<stderr>:  what():  CUDA runtime API error cudaErrorStreamCaptureUnsupported (900):
[1,3]<stderr>:operation not permitted when stream is capturing

LSC527 avatar Feb 24 '22 06:02 LSC527

@LSC527 In general DALI is not capturable. We can investigate, but it's unlikely that a fix is possible on DALI's side if MXNet runs on stream 0.

mzient avatar Feb 25 '22 17:02 mzient

Hi @LSC527,

I think it would be best to ask to raise the issue in the DeepLearningExamples project and ask if the given model supports MXNET_ENABLE_CUDA_GRAPHS=1. It is possible to use CUDA graphs for the model training and DALI together (as NVIDIA does in MLPerf) but you need to check with the model maintainers for more details.

JanuszL avatar Feb 28 '22 10:02 JanuszL