Serving icon indicating copy to clipboard operation
Serving copied to clipboard

A10的GPU上运行模型,过一段时间就会报CUDA error(700), an illegal memory access was encountered,机器看不出任何问题

Open bdbaigc opened this issue 1 year ago • 1 comments

ERROR 2023-08-18 22:36:56,242 [operator.py:1079] [text_quality] failed to predict. (data_id=517045 log_id=517045) [text_quality|6] Failed to process(batch: [517045]): (External) CUDA error(700), an illegal memory access was encountered. [Hint: Please search for the error code(700) on website (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038) to get Nvidia's official solution and advice about CUDA Error.] (at /paddle/paddle/phi/backends/gpu/cuda/cuda_info.cc:252) . Please check the input dict and checkout PipelineServingLogs/pipeline.log for more details. INFO 2023-08-18 22:36:56,242 [operator.py:1454] prometheus inf count +1 ERROR 2023-08-18 22:36:56,247 [dag.py:420] (data_id=517045 log_id=0) Failed to predict: [text_quality] failed to predict. (data_id=517045 log_id=517045) [text_quality|6] Failed to process(batch: [517045]): (External) CUDA error(700), an illegal memory access was encountered. [Hint: Please search for the error code(700) on website (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038) to get Nvidia's official solution and advice about CUDA Error.] (at /paddle/paddle/phi/backends/gpu/cuda/cuda_info.cc:252) . Please check the input dict and checkout PipelineServingLogs/pipeline.log for more details

nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 A10 On | 00000000:5E:00.0 Off | 0 | | 0% 44C P0 57W / 150W | 6606MiB / 22731MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================|

没看出任何问题,

bdbaigc avatar Aug 18 '23 15:08 bdbaigc

同问

xinj7 avatar Oct 07 '23 09:10 xinj7