Error Code 1: Myelin ([cask.cpp:exec:972] Platform (Cuda) error
Whenever I am running MLperf Inferencing for Llama2-70b in a docker container, I am getting this below error. I deleted the container image and run again but still same error. Host server is running RHEL9.2 with 8 x H100 80GB GPUs, with high-performance wekafs file storage mounted with Nvidia GDS.
[TensorRT-LLM][ERROR] 1: [runner.cpp::executeMyelinGraph::682] Error Code 1: Myelin ([cask.cpp:exec:972] Platform (Cuda) error) [TensorRT-LLM][ERROR] Encountered an error in forward function: Executing TRT engine failed! [TensorRT-LLM][WARNING] Step function failed, continuing.
These RPMs are installed in the host server. cm-nvidia-container-toolkit-1.14.2-100070_cm10.0_6ea8822f81.x86_64 nvidia-driver-cuda-libs-550.90.07-1.el9.x86_64 nvidia-libXNVCtrl-550.90.07-2.el9.x86_64 nvidia-driver-NVML-550.90.07-1.el9.x86_64 nvidia-driver-NvFBCOpenGL-550.90.07-1.el9.x86_64 nvidia-driver-libs-550.90.07-1.el9.x86_64 nvidia-persistenced-550.90.07-1.el9.x86_64 nvidia-driver-cuda-550.90.07-1.el9.x86_64 dnf-plugin-nvidia-2.2-1.el9.noarch kmod-nvidia-open-dkms-550.90.07-1.el9.x86_64 nvidia-kmod-common-550.90.07-1.el9.noarch nvidia-driver-550.90.07-1.el9.x86_64 nvidia-modprobe-550.90.07-2.el9.x86_64 nvidia-settings-550.90.07-2.el9.x86_64 nvidia-xconfig-550.90.07-2.el9.x86_64 nvidia-driver-devel-550.90.07-1.el9.x86_64 nvidia-libXNVCtrl-devel-550.90.07-2.el9.x86_64 nvidia-fabric-manager-550.90.07-1.x86_64 nvidia-gds-12-5-12.5.1-1.x86_64 nvidia-gds-12.5.1-1.x86_64 nvidia-fs-dkms-2.22.3-1.x86_64 nvidia-fs-2.22.3-1.x86_64 [root@hxxxx ~]# rpm -qa |grep -i cuda cuda-dcgm-libs-3.3.6.1-100101_cm10.0_463140abaf.x86_64 nvidia-driver-cuda-libs-550.90.07-1.el9.x86_64 nvidia-driver-cuda-550.90.07-1.el9.x86_64 cuda-toolkit-config-common-12.5.82-1.noarch cuda-toolkit-12-config-common-12.5.82-1.noarch cuda-toolkit-12-5-config-common-12.5.82-1.noarch
RHEL9.2 kernel: 5.14.0-284.30.1.el9_2.x86_64
Docker container where mlperf test is running. nvcr.io/nvidia/mlperf/mlperf-inference mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public
Docker container where mlperf test is running. nvcr.io/nvidia/mlperf/mlperf-inference mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public
Hi, have you solve this problem? I come across similar problem~
Could you repost your issue on https://github.com/NVIDIA/TensorRT-LLM/issues, please?
This is probably cause by the environment (hardware and software) of engine file generation and execution is not the same. The user encounter this better try to delete the engine files and regen in the current environment to verify if it is the case.
@moraxu @cloudhan Hi ,can you help me take a look at this issue tensorrt version:10.1.0.27
I encountered a similar error. I have multiple GPUs and models running on different GPUs. The program will start and stop multiple times, and occasionally different models will report similar errors. Sometimes all models run normally
And there may be occasional errors when copying from CPU to GPU during operation
src code:
I have searched online and suspected that it is a GPU ID issue, but before parsing the trt model, I will call the CUDASetDevice interface once. Then, before executing the following code, I will also call the CUDASetDevice interface once, and the parameters corresponding to the interface are the same GPU device
I have searched online and suspect that it is a GPU ID issue,