TensorRT
TensorRT copied to clipboard
Trtexec with CUDAGraph happen rarely Cuda failure: an illegal memory access was encountered.
Description
set CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1, it may repro the issue of an illegal memory access but can’t generate cuda dump. why? with compute-sanitizer or cuda-gdb, i can’t repro the issue so far. why? When you only run one trtexec, it can’t happen. but 2 and more can happen rarely the illegal memory access. So I suspect it may have memory conflicts between two process.
Two programs: trtexec --loadEngine=fp16_fusion.trt --device=1 --useCudaGraph --iterations=1000000 trtexec --loadEngine=fp16_fusion.trt --device=1 --useCudaGraph --iterations=1000000
Result: Starting inference Cuda failure: an illegal memory access was encountered
[01/25/2024-18:06:34] [I] === Inference Options === [01/25/2024-18:06:34] [I] Iterations: 1000000 [01/25/2024-18:06:34] [I] Duration: 3s (+ 200ms warm up) [01/25/2024-18:06:34] [I] Sleep time: 0ms [01/25/2024-18:06:34] [I] Idle time: 0ms [01/25/2024-18:06:34] [I] Streams: 1 [01/25/2024-18:06:34] [I] ExposeDMA: Disabled [01/25/2024-18:06:34] [I] Data transfers: Enabled [01/25/2024-18:06:34] [I] Spin-wait: Disabled [01/25/2024-18:06:34] [I] Multithreading: Disabled [01/25/2024-18:06:34] [I] CUDA Graph: Enabled [01/25/2024-18:06:34] [I] Separate profiling: Disabled [01/25/2024-18:06:34] [I] Time Deserialize: Disabled [01/25/2024-18:06:34] [I] Time Refit: Disabled [01/25/2024-18:06:34] [I] NVTX verbosity: 0 [01/25/2024-18:06:34] [I] Persistent Cache Ratio: 0 [01/25/2024-18:06:34] [I] Inputs:
[01/25/2024-18:06:34] [I] === Device Information === [01/25/2024-18:06:34] [I] Selected Device: NVIDIA GeForce RTX 4090 [01/25/2024-18:06:34] [I] Compute Capability: 8.9 [01/25/2024-18:06:34] [I] SMs: 128 [01/25/2024-18:06:34] [I] Compute Clock Rate: 2.52 GHz [01/25/2024-18:06:34] [I] Device Global Memory: 24217 MiB [01/25/2024-18:06:34] [I] Shared Memory per SM: 100 KiB [01/25/2024-18:06:34] [I] Memory Bus Width: 384 bits (ECC disabled) [01/25/2024-18:06:34] [I] Memory Clock Rate: 10.501 GHz
Enviromnent: docker run -itd --name xxx --gpus all nvcr.io/nvidia/pytorch:23.09-py3
### Tasks
[01/26/2024-13:29:27] [I] Starting inference [New Thread 0x7fffa1fff000 (LWP 3603)] [New Thread 0x7fffa17fe000 (LWP 3604)] Cuda failure: an illegal memory access was encountered
Thread 6 "trtexec" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffa17fe000 (LWP 3604)]
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140735902900224) at ./nptl/pthread_kill.c:44
44 ./nptl/pthread_kill.c: No such file or directory.
(gdb) backtrace
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140735902900224) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=140735902900224) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=140735902900224, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x00007fffb75ee476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007fffb75d47f3 in __GI_abort () at ./stdlib/abort.c:79
#5 0x000000000041279e in sample::cudaCheck(cudaError, std::ostream&) [clone .part.174] [clone .constprop.741] ()
#6 0x000000000041a23e in void sample::(anonymous namespace)::inferenceExecutionnvinfer1::IExecutionContext(sample::InferenceOptions const&, sample::InferenceEnvironment&, sample::(anonymous namespace)::SyncStruct&, int, int, int, std::vector<sample::InferenceTrace, std::allocatorsample::InferenceTrace >&) ()
#7 0x0000000000411644 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(sample::InferenceOptions const&, sample::InferenceEnvironment&, sample::(anonymous namespace)::SyncStruct&, int, int, int, std::vector<sample::InferenceTrace, std::allocatorsample::InferenceTrace >&), std::reference_wrapper<sample::InferenceOptions const>, std::reference_wrappersample::InferenceEnvironment, std::reference_wrapper<sample::(anonymous namespace)::SyncStruct>, int, int, int, std::reference_wrapper<std::vector<sample::InferenceTrace, std::allocatorsample::InferenceTrace > > > > >::_M_run() ()
#8 0x000000000045287f in execute_native_thread_routine ()
#9 0x00007fffb7640ac3 in start_thread (arg=
This looks like a bug, could you please provide a reproduce? Thanks!
Reproduce step:2 or 3 or 4 at the same time, On GPU 4090 trtexec --loadEngine=fp16_fusion.trt --device=1 --useCudaGraph --iterations=1000000 trtexec --loadEngine=fp16_fusion.trt --device=1 --useCudaGraph --iterations=1000000 ...... it takes 20mins or more to repro the issue. how to send the test onnx to you? Thanks!
You can send a private Google drive link here and I'll request for access, P.S. the email address would be similar to my github name but I don't want to leak it here.
I also repro the issue with resnet50_1x1024x1024.trt with 4 streams, step:
- download resnet50_1x1024x1024.onnx form google drive below and convertd to trt engine.
- mpirun -np 4 trtexec --loadEngine=resnet50_1x1024x1024.trt --useCudaGraph --streams=4 --iterations=1000000 --device=1
- it takes 10mins or more to repro the issue. Google dirve link: https://drive.google.com/file/d/1wCuP0qpb96wYPE9ds9q-Bj1fg0eL687h/view?usp=drive_link
Note:it become not easy to be reproduced with CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1, so please don't set the env variable first.
Thanks!
Requested access, please reply after you grant the access(so that I can get the notification), thanks!
Zero Done
Hi Zero Can you reproduce the issue by above step? if have any alignments, please let me know, Thanks a lot!! Thanks!
zerollzeng The issue is urgent,could you help analyze it soon?Thanks very much!
I ran mpirun -np 4 trtexec --loadEngine=resnet50_1x1024x1024.trt --useCudaGraph --streams=4 --iterations=1000000 --device=1 for over 30 mins and didn't reproduce the issue.
Could you please try:
- Use our official TensorRT docker image.
- Ran 4 process simultaneously instead of using mpirun?
sure i can't repro it,,too in official TensorRT docker image, but check the onnx and found: sorry, I sent the wrong file to you. please use new test file to repro in pytorch:23.09-py3 where i can repro the issue. https://drive.google.com/file/d/1aijSI-1uTHjcOE0MVqpftSAeYFzvMiXF/view?usp=drive_link
docker run -itd --name xxx --gpus all nvcr.io/nvidia/pytorch:23.09-py3
zerollzeng This is serious issue on RTX 4090. With test onnx and latest official TensorRT docker image, the issue is still reproduced. trtexec --loadEngine=./test.trt --device=1 --useCudaGraph --iterations=10000000 --streams=4& trtexec --loadEngine=./test.trt --device=1 --useCudaGraph --iterations=10000000 --streams=4 &
image: docker run -it --gpus all nvcr.io/nvidia/pytorch:24.01-py3
dmesg: [251892.020770] NVRM: Xid (PCI:0000:02:00): 31, pid=3241665, name=trtexec, Ch 00000063, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_GCC faulted @ 0x7fbd_8e12a000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
I can reproduce the error, Filed internal bug 4479712 to track this.
@lianggangMei, @zerollzeng can you please provide the fp16_fusion.trt file? I am interested in digging into the issue.
@lianggangMei We've debugged this issue internally and the problem is in CUDA. I can not share the details here but we are planning to provide a fix in the upcoming CUDA versions.
Hi, Oxana I'm happy to hear your fix message. Thanks! but I confused, need your confirmation "When I only run one trtexec, it can’t happen. but 2 and more can happen the illegal memory access. " Could you provide some advices how to avoid the issue before the fix CUDA versions?
@lianggangMei As I said earlier, I can not disclose here the details of the work around we currently have internally for this issue. If your company has NDA with Nvidia you can reach out your developer technology specialist and they will provide you a solution. Otherwise you need to wait till official release. I don't have a conformation yet which CUDA version is going to have the fix.