Running caddn_paddle model get CUDNN_STATUS_NOT_SUPPORTED error
We appreciate you go through Apollo documentations and search previous issues before creating an new one. If neither of the sources helped you with your issues, please report the issue using the following form. Please note missing info can delay the response time.
System information
- OS Platform and Distribution (apollo8.0 docker image):
- Apollo installed from (build source in docker container):
- Apollo version (8.0):
- Output of
apollo.sh configif onmasterbranch: - NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2
-
- Tesla V100
Steps to reproduce the issue:
- Please use bullet points and include as much details as possible:
- install caddn_paddle by amodel install caddn_paddle.zip
- change modules/perception/production/conf/perception/perception_common.flag by appending caddn_model_file and caddn_params_file to load caddn_paddle model, also change modules/perception/pipeline/config/camera_detection_pipeline.pb.txt to load caddn_paddle
- running caddn_paddle model by mainboard -d modules/perception/production/dag/dag_streaming_perception_camera.dag
the detector component occur error:
terminate called after throwing an instance of 'phi::enforce::EnforceNotMet' what(): (External) CUDNN error(9), CUDNN_STATUS_NOT_SUPPORTED. [Hint: Please search for the error code(9) on website (https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnStatus_t) to get Nvidia's official solution and advice about CUDNN Error.] (at /apollo/data/Paddle/paddle/fluid/operators/grid_sampler_cudnn_op.cu.cc:81) [operator < grid_sampler > error] Aborted
Supporting materials (screenshots, command lines, code/script snippets):
what is your output of apollo.sh config in the docker container you use, maybe the cuda version doesn't match the driver?
what is your output of apollo.sh config in the docker container you use, maybe the cuda version doesn't match the driver?
I'm running apollo docker on the Baidu Cloud, apoll.sh config return [CGPU-CUDA:ERR]
cgpu auth check failed, proc will exit soon, please check your running environment, we only support baidu cloud.
But I don't think this is a problem, it's related to GPU memory sharing.I've tried CUDA Version: 11.2 and 12.2, with the same error.
I think this may be a problem with the paddle caddn operator, I will check and feedback then