Segmentation fault while loading CUDA Provider
Describe the issue
I have built ONNX Runtime with CUDA provider, however when I try to link a program with libonnxruntime_providers_cuda.so get segfault at startup of the program. Can anyone help?
Note: All of onnxruntime dependencies statically linked except libonnxruntime_providers_shared.so and libonnxruntime_providers_cuda.so
Valgrind output:
==9835== Invalid read of size 8
==9835== at 0x72E90A0: onnxruntime::DataTypeImpl const* onnxruntime::DataTypeImpl::GetTensorType<onnxruntime::MLFloat16>() (in /home/ozan/.cache/bazel/_bazel_ozan/a4e6aac22f06572491397a87dc0175dd/execroot/__main__/bazel-out/k8-fastbuild/bin/onnxruntime/lib/libonnxruntime_providers_cuda.so)
==9835== by 0x709F5F7: _GLOBAL__sub_I_cast_op.cc (in /home/ozan/.cache/bazel/_bazel_ozan/a4e6aac22f06572491397a87dc0175dd/execroot/__main__/bazel-out/k8-fastbuild/bin/onnxruntime/lib/libonnxruntime_providers_cuda.so)
==9835== by 0x607447D: call_init.part.0 (dl-init.c:70)
==9835== by 0x6074567: call_init (dl-init.c:33)
==9835== by 0x6074567: _dl_init (dl-init.c:117)
==9835== by 0x608E2E9: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
==9835== Address 0x0 is not stack'd, malloc'd or (recently) free'd
==9835==
==9835==
==9835== Process terminating with default action of signal 11 (SIGSEGV)
==9835== Access not within mapped region at address 0x0
==9835== at 0x72E90A0: onnxruntime::DataTypeImpl const* onnxruntime::DataTypeImpl::GetTensorType<onnxruntime::MLFloat16>() (in /home/ozan/.cache/bazel/_bazel_ozan/a4e6aac22f06572491397a87dc0175dd/execroot/__main__/bazel-out/k8-fastbuild/bin/onnxruntime/lib/libonnxruntime_providers_cuda.so)
==9835== by 0x709F5F7: _GLOBAL__sub_I_cast_op.cc (in /home/ozan/.cache/bazel/_bazel_ozan/a4e6aac22f06572491397a87dc0175dd/execroot/__main__/bazel-out/k8-fastbuild/bin/onnxruntime/lib/libonnxruntime_providers_cuda.so)
==9835== by 0x607447D: call_init.part.0 (dl-init.c:70)
==9835== by 0x6074567: call_init (dl-init.c:33)
==9835== by 0x6074567: _dl_init (dl-init.c:117)
==9835== by 0x608E2E9: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
Urgency
Urgent because we are near to new release
Target platform
Linux
Build script
Complicated
Error / output
SEGFAULT
Visual Studio Version
No response
GCC / Compiler Version
gcc-11.3.0
can you solve it ? I have the same problem.
I met the same problem when I used both libtorch and libonnxruntime-gpu. And I tried libtorch-cpu and libtorch-gpu, but the problem didn't go away.
Solution: do not link libonnxruntime_providers_cuda.so while building. Put that dynamic lib in same folder with other onnxruntime dynamic libraries (or in same folder with your executable if you are linking onnxruntime statically), then use session_options.AppendExecutionProvider_CUDA to use CUDA EP. onnxruntime will load that library itself then.
Seeing same issue here. Although I copied the .so file directly from the linux x86 release here https://github.com/microsoft/onnxruntime/releases/tag/v1.17.1 so I don't have access to debug symbols.
Here is the gdb output on my system
(gdb) where
#0 0x00007fffd9e97736 in ?? ()
from /home/axby/bazel_cache/1d86e8c08993f69eb943b44e0ec77f39/execroot/_main/bazel-out/k8-dbg/bin/examples/../_solib_k8/_U_S_Sthird_Uparty_Sonnxruntime_Connxruntime_Uproviders_Ucuda_Uso___Uthird_Uparty_Sonnxruntime_Slinux_Ux64/libonnxruntime_providers_cuda.so
#1 0x00007ffff7fc947e in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffded8, env=env@entry=0x7fffffffdee8) at ./elf/dl-init.c:70
#2 0x00007ffff7fc9568 in call_init (env=0x7fffffffdee8, argv=0x7fffffffded8, argc=1, l=<optimized out>) at ./elf/dl-init.c:33
#3 _dl_init (main_map=0x7ffff7ffe2e0, argc=1, argv=0x7fffffffded8, env=0x7fffffffdee8) at ./elf/dl-init.c:117
#4 0x00007ffff7fe32ca in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#5 0x0000000000000001 in ?? ()
#6 0x00007fffffffe248 in ?? ()
#7 0x0000000000000000 in ?? ()
(gdb)
@ozanarmagan I see you are using Bazel. I'm having trouble having Bazel automatically copy this libonnxruntime_providers_cuda.so into the right directory (bazel-bin/..../_solib_k8/..../). How did you manage this?
Same issue is still there even for 1.20.0 release version. Ubuntu 20.04 / x86 with CuDNN 9.5.1 and CUDA 12.2.
Same issue is still there even for 1.20.0 release version. Ubuntu 20.04 / x86 with CuDNN 9.5.1 and CUDA 12.2.
Have you used the method proposed by @ozanarmagan ? I solved the problem by using his method.
Place all the libonnxruntime dynamic libraries in the same directory:
# tree onnxruntime_gpu/
onnxruntime_gpu/
├── libonnxruntime.so -> libonnxruntime.so.1.13.1
├── libonnxruntime.so.1.13.1
├── libonnxruntime_providers_cuda.so
└── libonnxruntime_providers_shared.so
When building, link only libonnxruntime.so and avoid directly linking libonnxruntime_providers_cuda.so.
You can link the library with the following command:
-Wl,-rpath /path/to/libonnxruntime -L /path/to/libonnxruntime -lonnxruntime
You can also put it in any folder in LD_LIBRARY_PATH
Same issue is still there even for 1.20.0 release version. Ubuntu 20.04 / x86 with CuDNN 9.5.1 and CUDA 12.2.
Have you used the method proposed by @ozanarmagan ? I solved the problem by using his method.
Place all the
libonnxruntimedynamic libraries in the same directory:# tree onnxruntime_gpu/ onnxruntime_gpu/ ├── libonnxruntime.so -> libonnxruntime.so.1.13.1 ├── libonnxruntime.so.1.13.1 ├── libonnxruntime_providers_cuda.so └── libonnxruntime_providers_shared.soWhen building, link only
libonnxruntime.soand avoid directly linkinglibonnxruntime_providers_cuda.so. You can link the library with the following command:-Wl,-rpath /path/to/libonnxruntime -L /path/to/libonnxruntime -lonnxruntime
I'm sorry what does this mean? my current cmakelist file is like this
set(CMAKE_PREFIX_PATH "${PROJECT_SOURCE_DIR}/third_party/onnxruntime-linux-x64-gpu-1.20.1/lib64/cmake/onnxruntime")
find_package(onnxruntime REQUIRED)
...
target_link_libraries(test ${OpenCV_LIBS} torch onnxruntime::onnxruntime)
You can also put it in any folder in
LD_LIBRARY_PATH
I'm sorry what does this mean? my current cmakelist file is like this
set(CMAKE_PREFIX_PATH "${PROJECT_SOURCE_DIR}/third_party/onnxruntime-linux-x64-gpu-1.20.1/lib64/cmake/onnxruntime")
find_package(onnxruntime REQUIRED)
...
target_link_libraries(test ${OpenCV_LIBS} torch onnxruntime::onnxruntime)
If you ever find yourself using CUDA provider inside of scratch-based docker container, add LD_LIBRARY_PATH=/usr/lib64 to the container/image environment variables.
Applying stale label due to no activity in 30 days
Applying stale label due to no activity in 30 days