onnxruntime icon indicating copy to clipboard operation
onnxruntime copied to clipboard

Segmentation fault while loading CUDA Provider

Open ozanarmagan opened this issue 2 years ago • 5 comments

Describe the issue

I have built ONNX Runtime with CUDA provider, however when I try to link a program with libonnxruntime_providers_cuda.so get segfault at startup of the program. Can anyone help?

Note: All of onnxruntime dependencies statically linked except libonnxruntime_providers_shared.so and libonnxruntime_providers_cuda.so

Valgrind output:

==9835== Invalid read of size 8
==9835==    at 0x72E90A0: onnxruntime::DataTypeImpl const* onnxruntime::DataTypeImpl::GetTensorType<onnxruntime::MLFloat16>() (in /home/ozan/.cache/bazel/_bazel_ozan/a4e6aac22f06572491397a87dc0175dd/execroot/__main__/bazel-out/k8-fastbuild/bin/onnxruntime/lib/libonnxruntime_providers_cuda.so)
==9835==    by 0x709F5F7: _GLOBAL__sub_I_cast_op.cc (in /home/ozan/.cache/bazel/_bazel_ozan/a4e6aac22f06572491397a87dc0175dd/execroot/__main__/bazel-out/k8-fastbuild/bin/onnxruntime/lib/libonnxruntime_providers_cuda.so)
==9835==    by 0x607447D: call_init.part.0 (dl-init.c:70)
==9835==    by 0x6074567: call_init (dl-init.c:33)
==9835==    by 0x6074567: _dl_init (dl-init.c:117)
==9835==    by 0x608E2E9: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
==9835==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==9835== 
==9835== 
==9835== Process terminating with default action of signal 11 (SIGSEGV)
==9835==  Access not within mapped region at address 0x0
==9835==    at 0x72E90A0: onnxruntime::DataTypeImpl const* onnxruntime::DataTypeImpl::GetTensorType<onnxruntime::MLFloat16>() (in /home/ozan/.cache/bazel/_bazel_ozan/a4e6aac22f06572491397a87dc0175dd/execroot/__main__/bazel-out/k8-fastbuild/bin/onnxruntime/lib/libonnxruntime_providers_cuda.so)
==9835==    by 0x709F5F7: _GLOBAL__sub_I_cast_op.cc (in /home/ozan/.cache/bazel/_bazel_ozan/a4e6aac22f06572491397a87dc0175dd/execroot/__main__/bazel-out/k8-fastbuild/bin/onnxruntime/lib/libonnxruntime_providers_cuda.so)
==9835==    by 0x607447D: call_init.part.0 (dl-init.c:70)
==9835==    by 0x6074567: call_init (dl-init.c:33)
==9835==    by 0x6074567: _dl_init (dl-init.c:117)
==9835==    by 0x608E2E9: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)

Urgency

Urgent because we are near to new release

Target platform

Linux

Build script

Complicated

Error / output

SEGFAULT

Visual Studio Version

No response

GCC / Compiler Version

gcc-11.3.0

ozanarmagan avatar May 29 '23 11:05 ozanarmagan

can you solve it ? I have the same problem.

JXQI avatar Aug 03 '23 08:08 JXQI

I met the same problem when I used both libtorch and libonnxruntime-gpu. And I tried libtorch-cpu and libtorch-gpu, but the problem didn't go away.

HibiscusRiseSun avatar Aug 15 '23 09:08 HibiscusRiseSun

Solution: do not link libonnxruntime_providers_cuda.so while building. Put that dynamic lib in same folder with other onnxruntime dynamic libraries (or in same folder with your executable if you are linking onnxruntime statically), then use session_options.AppendExecutionProvider_CUDA to use CUDA EP. onnxruntime will load that library itself then.

ozanarmagan avatar Aug 17 '23 08:08 ozanarmagan

Seeing same issue here. Although I copied the .so file directly from the linux x86 release here https://github.com/microsoft/onnxruntime/releases/tag/v1.17.1 so I don't have access to debug symbols.

Here is the gdb output on my system


(gdb) where
#0  0x00007fffd9e97736 in ?? ()
   from /home/axby/bazel_cache/1d86e8c08993f69eb943b44e0ec77f39/execroot/_main/bazel-out/k8-dbg/bin/examples/../_solib_k8/_U_S_Sthird_Uparty_Sonnxruntime_Connxruntime_Uproviders_Ucuda_Uso___Uthird_Uparty_Sonnxruntime_Slinux_Ux64/libonnxruntime_providers_cuda.so
#1  0x00007ffff7fc947e in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffded8, env=env@entry=0x7fffffffdee8) at ./elf/dl-init.c:70
#2  0x00007ffff7fc9568 in call_init (env=0x7fffffffdee8, argv=0x7fffffffded8, argc=1, l=<optimized out>) at ./elf/dl-init.c:33
#3  _dl_init (main_map=0x7ffff7ffe2e0, argc=1, argv=0x7fffffffded8, env=0x7fffffffdee8) at ./elf/dl-init.c:117
#4  0x00007ffff7fe32ca in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#5  0x0000000000000001 in ?? ()
#6  0x00007fffffffe248 in ?? ()
#7  0x0000000000000000 in ?? ()
(gdb) 

axbycc-mark avatar Jun 27 '24 18:06 axbycc-mark

@ozanarmagan I see you are using Bazel. I'm having trouble having Bazel automatically copy this libonnxruntime_providers_cuda.so into the right directory (bazel-bin/..../_solib_k8/..../). How did you manage this?

axbycc-mark avatar Jun 27 '24 21:06 axbycc-mark

Same issue is still there even for 1.20.0 release version. Ubuntu 20.04 / x86 with CuDNN 9.5.1 and CUDA 12.2.

ipoletaev avatar Nov 09 '24 06:11 ipoletaev

Same issue is still there even for 1.20.0 release version. Ubuntu 20.04 / x86 with CuDNN 9.5.1 and CUDA 12.2.

Have you used the method proposed by @ozanarmagan ? I solved the problem by using his method.

Place all the libonnxruntime dynamic libraries in the same directory:

# tree onnxruntime_gpu/
onnxruntime_gpu/
├── libonnxruntime.so -> libonnxruntime.so.1.13.1
├── libonnxruntime.so.1.13.1
├── libonnxruntime_providers_cuda.so
└── libonnxruntime_providers_shared.so

When building, link only libonnxruntime.so and avoid directly linking libonnxruntime_providers_cuda.so. You can link the library with the following command:

-Wl,-rpath /path/to/libonnxruntime -L /path/to/libonnxruntime -lonnxruntime

HibiscusRiseSun avatar Nov 15 '24 07:11 HibiscusRiseSun

You can also put it in any folder in LD_LIBRARY_PATH

ozanarmagan avatar Nov 15 '24 12:11 ozanarmagan

Same issue is still there even for 1.20.0 release version. Ubuntu 20.04 / x86 with CuDNN 9.5.1 and CUDA 12.2.

Have you used the method proposed by @ozanarmagan ? I solved the problem by using his method.

Place all the libonnxruntime dynamic libraries in the same directory:

# tree onnxruntime_gpu/
onnxruntime_gpu/
├── libonnxruntime.so -> libonnxruntime.so.1.13.1
├── libonnxruntime.so.1.13.1
├── libonnxruntime_providers_cuda.so
└── libonnxruntime_providers_shared.so

When building, link only libonnxruntime.so and avoid directly linking libonnxruntime_providers_cuda.so. You can link the library with the following command:

-Wl,-rpath /path/to/libonnxruntime -L /path/to/libonnxruntime -lonnxruntime

I'm sorry what does this mean? my current cmakelist file is like this

set(CMAKE_PREFIX_PATH "${PROJECT_SOURCE_DIR}/third_party/onnxruntime-linux-x64-gpu-1.20.1/lib64/cmake/onnxruntime")
find_package(onnxruntime REQUIRED)
...
target_link_libraries(test ${OpenCV_LIBS} torch onnxruntime::onnxruntime)

UnlimitedR avatar Jan 25 '25 18:01 UnlimitedR

You can also put it in any folder in LD_LIBRARY_PATH

I'm sorry what does this mean? my current cmakelist file is like this

set(CMAKE_PREFIX_PATH "${PROJECT_SOURCE_DIR}/third_party/onnxruntime-linux-x64-gpu-1.20.1/lib64/cmake/onnxruntime")
find_package(onnxruntime REQUIRED)
...
target_link_libraries(test ${OpenCV_LIBS} torch onnxruntime::onnxruntime)

UnlimitedR avatar Jan 25 '25 18:01 UnlimitedR

If you ever find yourself using CUDA provider inside of scratch-based docker container, add LD_LIBRARY_PATH=/usr/lib64 to the container/image environment variables.

infastin avatar Mar 21 '25 20:03 infastin

Applying stale label due to no activity in 30 days

Applying stale label due to no activity in 30 days