TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

Error Code 1: Myelin ([cask.cpp:exec:972] Platform (Cuda) error

Open jaiswackhv opened this issue 1 year ago • 3 comments

Whenever I am running MLperf Inferencing for Llama2-70b in a docker container, I am getting this below error. I deleted the container image and run again but still same error. Host server is running RHEL9.2 with 8 x H100 80GB GPUs, with high-performance wekafs file storage mounted with Nvidia GDS.

[TensorRT-LLM][ERROR] 1: [runner.cpp::executeMyelinGraph::682] Error Code 1: Myelin ([cask.cpp:exec:972] Platform (Cuda) error) [TensorRT-LLM][ERROR] Encountered an error in forward function: Executing TRT engine failed! [TensorRT-LLM][WARNING] Step function failed, continuing.

These RPMs are installed in the host server. cm-nvidia-container-toolkit-1.14.2-100070_cm10.0_6ea8822f81.x86_64 nvidia-driver-cuda-libs-550.90.07-1.el9.x86_64 nvidia-libXNVCtrl-550.90.07-2.el9.x86_64 nvidia-driver-NVML-550.90.07-1.el9.x86_64 nvidia-driver-NvFBCOpenGL-550.90.07-1.el9.x86_64 nvidia-driver-libs-550.90.07-1.el9.x86_64 nvidia-persistenced-550.90.07-1.el9.x86_64 nvidia-driver-cuda-550.90.07-1.el9.x86_64 dnf-plugin-nvidia-2.2-1.el9.noarch kmod-nvidia-open-dkms-550.90.07-1.el9.x86_64 nvidia-kmod-common-550.90.07-1.el9.noarch nvidia-driver-550.90.07-1.el9.x86_64 nvidia-modprobe-550.90.07-2.el9.x86_64 nvidia-settings-550.90.07-2.el9.x86_64 nvidia-xconfig-550.90.07-2.el9.x86_64 nvidia-driver-devel-550.90.07-1.el9.x86_64 nvidia-libXNVCtrl-devel-550.90.07-2.el9.x86_64 nvidia-fabric-manager-550.90.07-1.x86_64 nvidia-gds-12-5-12.5.1-1.x86_64 nvidia-gds-12.5.1-1.x86_64 nvidia-fs-dkms-2.22.3-1.x86_64 nvidia-fs-2.22.3-1.x86_64 [root@hxxxx ~]# rpm -qa |grep -i cuda cuda-dcgm-libs-3.3.6.1-100101_cm10.0_463140abaf.x86_64 nvidia-driver-cuda-libs-550.90.07-1.el9.x86_64 nvidia-driver-cuda-550.90.07-1.el9.x86_64 cuda-toolkit-config-common-12.5.82-1.noarch cuda-toolkit-12-config-common-12.5.82-1.noarch cuda-toolkit-12-5-config-common-12.5.82-1.noarch

RHEL9.2 kernel: 5.14.0-284.30.1.el9_2.x86_64

jaiswackhv avatar Aug 14 '24 19:08 jaiswackhv

Docker container where mlperf test is running. nvcr.io/nvidia/mlperf/mlperf-inference mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public

jaiswackhv avatar Aug 14 '24 19:08 jaiswackhv

Docker container where mlperf test is running. nvcr.io/nvidia/mlperf/mlperf-inference mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public

Hi, have you solve this problem? I come across similar problem~ Image

lishicheng1996 avatar Sep 12 '24 06:09 lishicheng1996

Could you repost your issue on https://github.com/NVIDIA/TensorRT-LLM/issues, please?

moraxu avatar Sep 16 '24 19:09 moraxu

This is probably cause by the environment (hardware and software) of engine file generation and execution is not the same. The user encounter this better try to delete the engine files and regen in the current environment to verify if it is the case.

cloudhan avatar Dec 03 '24 07:12 cloudhan

@moraxu @cloudhan Hi ,can you help me take a look at this issue tensorrt version:10.1.0.27

I encountered a similar error. I have multiple GPUs and models running on different GPUs. The program will start and stop multiple times, and occasionally different models will report similar errors. Sometimes all models run normally

Image

And there may be occasional errors when copying from CPU to GPU during operation

Image Image

src code:

I have searched online and suspected that it is a GPU ID issue, but before parsing the trt model, I will call the CUDASetDevice interface once. Then, before executing the following code, I will also call the CUDASetDevice interface once, and the parameters corresponding to the interface are the same GPU device

Image I have searched online and suspect that it is a GPU ID issue,

wangyaxin1998 avatar Jun 23 '25 13:06 wangyaxin1998