TensorRT Trtexec with CUDAGraph happen rarely Cuda failure: an illegal memory access was encountered.

Description

set CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1, it may repro the issue of an illegal memory access but can’t generate cuda dump. why? with compute-sanitizer or cuda-gdb, i can’t repro the issue so far. why? When you only run one trtexec, it can’t happen. but 2 and more can happen rarely the illegal memory access. So I suspect it may have memory conflicts between two process.

Two programs: trtexec --loadEngine=fp16_fusion.trt --device=1 --useCudaGraph --iterations=1000000 trtexec --loadEngine=fp16_fusion.trt --device=1 --useCudaGraph --iterations=1000000

Result: Starting inference Cuda failure: an illegal memory access was encountered

[01/25/2024-18:06:34] [I] === Inference Options === [01/25/2024-18:06:34] [I] Iterations: 1000000 [01/25/2024-18:06:34] [I] Duration: 3s (+ 200ms warm up) [01/25/2024-18:06:34] [I] Sleep time: 0ms [01/25/2024-18:06:34] [I] Idle time: 0ms [01/25/2024-18:06:34] [I] Streams: 1 [01/25/2024-18:06:34] [I] ExposeDMA: Disabled [01/25/2024-18:06:34] [I] Data transfers: Enabled [01/25/2024-18:06:34] [I] Spin-wait: Disabled [01/25/2024-18:06:34] [I] Multithreading: Disabled [01/25/2024-18:06:34] [I] CUDA Graph: Enabled [01/25/2024-18:06:34] [I] Separate profiling: Disabled [01/25/2024-18:06:34] [I] Time Deserialize: Disabled [01/25/2024-18:06:34] [I] Time Refit: Disabled [01/25/2024-18:06:34] [I] NVTX verbosity: 0 [01/25/2024-18:06:34] [I] Persistent Cache Ratio: 0 [01/25/2024-18:06:34] [I] Inputs:

[01/25/2024-18:06:34] [I] === Device Information === [01/25/2024-18:06:34] [I] Selected Device: NVIDIA GeForce RTX 4090 [01/25/2024-18:06:34] [I] Compute Capability: 8.9 [01/25/2024-18:06:34] [I] SMs: 128 [01/25/2024-18:06:34] [I] Compute Clock Rate: 2.52 GHz [01/25/2024-18:06:34] [I] Device Global Memory: 24217 MiB [01/25/2024-18:06:34] [I] Shared Memory per SM: 100 KiB [01/25/2024-18:06:34] [I] Memory Bus Width: 384 bits (ECC disabled) [01/25/2024-18:06:34] [I] Memory Clock Rate: 10.501 GHz

Enviromnent: docker run -itd --name xxx --gpus all nvcr.io/nvidia/pytorch:23.09-py3

### Tasks

Jan 25 '24 13:01 lianggangMei

[01/26/2024-13:29:27] [I] Starting inference [New Thread 0x7fffa1fff000 (LWP 3603)] [New Thread 0x7fffa17fe000 (LWP 3604)] Cuda failure: an illegal memory access was encountered

Thread 6 "trtexec" received signal SIGABRT, Aborted. [Switching to Thread 0x7fffa17fe000 (LWP 3604)] __pthread_kill_implementation (no_tid=0, signo=6, threadid=140735902900224) at ./nptl/pthread_kill.c:44 44 ./nptl/pthread_kill.c: No such file or directory. (gdb) backtrace #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140735902900224) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=140735902900224) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140735902900224, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x00007fffb75ee476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x00007fffb75d47f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x000000000041279e in sample::cudaCheck(cudaError, std::ostream&) [clone .part.174] [clone .constprop.741] () #6 0x000000000041a23e in void sample::(anonymous namespace)::inferenceExecutionnvinfer1::IExecutionContext(sample::InferenceOptions const&, sample::InferenceEnvironment&, sample::(anonymous namespace)::SyncStruct&, int, int, int, std::vector<sample::InferenceTrace, std::allocatorsample::InferenceTrace >&) () #7 0x0000000000411644 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(sample::InferenceOptions const&, sample::InferenceEnvironment&, sample::(anonymous namespace)::SyncStruct&, int, int, int, std::vector<sample::InferenceTrace, std::allocatorsample::InferenceTrace >&), std::reference_wrapper<sample::InferenceOptions const>, std::reference_wrappersample::InferenceEnvironment, std::reference_wrapper<sample::(anonymous namespace)::SyncStruct>, int, int, int, std::reference_wrapper<std::vector<sample::InferenceTrace, std::allocatorsample::InferenceTrace > > > > >::_M_run() () #8 0x000000000045287f in execute_native_thread_routine () #9 0x00007fffb7640ac3 in start_thread (arg=) at ./nptl/pthread_create.c:442 #10 0x00007fffb76d1814 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

Jan 26 '24 12:01 lianggangMei

This looks like a bug, could you please provide a reproduce? Thanks!

Jan 27 '24 08:01 zerollzeng

Reproduce step：2 or 3 or 4 at the same time, On GPU 4090 trtexec --loadEngine=fp16_fusion.trt --device=1 --useCudaGraph --iterations=1000000 trtexec --loadEngine=fp16_fusion.trt --device=1 --useCudaGraph --iterations=1000000 ...... it takes 20mins or more to repro the issue. how to send the test onnx to you? Thanks!

Jan 27 '24 08:01 lianggangMei

You can send a private Google drive link here and I'll request for access, P.S. the email address would be similar to my github name but I don't want to leak it here.

Jan 27 '24 11:01 zerollzeng

I also repro the issue with resnet50_1x1024x1024.trt with 4 streams, step:

download resnet50_1x1024x1024.onnx form google drive below and convertd to trt engine.
mpirun -np 4 trtexec --loadEngine=resnet50_1x1024x1024.trt --useCudaGraph --streams=4 --iterations=1000000 --device=1
it takes 10mins or more to repro the issue. Google dirve link: https://drive.google.com/file/d/1wCuP0qpb96wYPE9ds9q-Bj1fg0eL687h/view?usp=drive_link

Note：it become not easy to be reproduced with CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1, so please don't set the env variable first.

Thanks!

Jan 27 '24 12:01 lianggangMei

Requested access, please reply after you grant the access(so that I can get the notification), thanks!

Jan 28 '24 09:01 zerollzeng

Zero Done

Jan 28 '24 09:01 lianggangMei

Hi Zero Can you reproduce the issue by above step? if have any alignments, please let me know, Thanks a lot!! Thanks!

Jan 29 '24 07:01 lianggangMei

zerollzeng The issue is urgent，could you help analyze it soon？Thanks very much！

Jan 31 '24 03:01 lianggangMei

I ran mpirun -np 4 trtexec --loadEngine=resnet50_1x1024x1024.trt --useCudaGraph --streams=4 --iterations=1000000 --device=1 for over 30 mins and didn't reproduce the issue.

Could you please try:

Use our official TensorRT docker image.
Ran 4 process simultaneously instead of using mpirun?

Jan 31 '24 14:01 zerollzeng

sure i can't repro it,,too in official TensorRT docker image, but check the onnx and found: sorry, I sent the wrong file to you. please use new test file to repro in pytorch:23.09-py3 where i can repro the issue. https://drive.google.com/file/d/1aijSI-1uTHjcOE0MVqpftSAeYFzvMiXF/view?usp=drive_link

docker run -itd --name xxx --gpus all nvcr.io/nvidia/pytorch:23.09-py3

Jan 31 '24 14:01 lianggangMei

zerollzeng This is serious issue on RTX 4090. With test onnx and latest official TensorRT docker image, the issue is still reproduced. trtexec --loadEngine=./test.trt --device=1 --useCudaGraph --iterations=10000000 --streams=4& trtexec --loadEngine=./test.trt --device=1 --useCudaGraph --iterations=10000000 --streams=4 &

image: docker run -it --gpus all nvcr.io/nvidia/pytorch:24.01-py3

dmesg: [251892.020770] NVRM: Xid (PCI:0000:02:00): 31, pid=3241665, name=trtexec, Ch 00000063, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_GCC faulted @ 0x7fbd_8e12a000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

Feb 01 '24 04:02 lianggangMei

I can reproduce the error, Filed internal bug 4479712 to track this.

Feb 01 '24 14:02 zerollzeng

@lianggangMei, @zerollzeng can you please provide the fp16_fusion.trt file? I am interested in digging into the issue.

Mar 18 '24 06:03 Abhishekghosh1998

@lianggangMei We've debugged this issue internally and the problem is in CUDA. I can not share the details here but we are planning to provide a fix in the upcoming CUDA versions.

Apr 26 '24 03:04 oxana-nvidia

Hi, Oxana I'm happy to hear your fix message. Thanks! but I confused, need your confirmation "When I only run one trtexec, it can’t happen. but 2 and more can happen the illegal memory access. " Could you provide some advices how to avoid the issue before the fix CUDA versions?

Apr 29 '24 03:04 lianggangMei

@lianggangMei As I said earlier, I can not disclose here the details of the work around we currently have internally for this issue. If your company has NDA with Nvidia you can reach out your developer technology specialist and they will provide you a solution. Otherwise you need to wait till official release. I don't have a conformation yet which CUDA version is going to have the fix.

Apr 29 '24 16:04 oxana-nvidia

TensorRT TensorRT copied to clipboard

Trtexec with CUDAGraph happen rarely Cuda failure: an illegal memory access was encountered.

Description

TensorRT
TensorRT copied to clipboard