djl
djl copied to clipboard
Onnxruntime-gpu 1.8.0 killed the process on cpu device
Environment Info
Container: Docker with NO GPU OS: AlmaLinux CUDA installed: 12.2 Cudnn installed: 8.9.0 djl version: 0.29.0 onnxruntime_gpu version: 1.8.0
Error Message
[root@r100048367-91051506-l5wvj powerop]# cat /tmp/hs_err_pid1062.log | more
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f6be8b25d12, pid=1062, tid=0x00007f6ddfdff640
#
# JRE version: OpenJDK Runtime Environment (8.0_302-b08) (build 1.8.0_302-b08)
# Java VM: OpenJDK 64-Bit Server VM (25.302-b08 mixed mode linux-amd64 )
# Problematic frame:
# C [libonnxruntime_providers_cuda.so+0x1a4d12]
#
# Core dump written. Default location: //core or core.1062
#
# If you would like to submit a bug report, please visit:
# https://github.com/adoptium/adoptium-support/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
--------------- T H R E A D ---------------
Current thread (0x00007f6ef6394000): JavaThread "igniteThread" daemon [_thread_in_native, id=1579, stack(0x00007f6ddfdc0000,0x00007f6ddfe00000)]
siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000000
Registers:
RAX=0x00007f6c07e18828, RBX=0x00007f6ddfdfc570, RCX=0x0000000000000006, RDX=0x0000000000000000
RSP=0x00007f6ddfdfc550, RBP=0x00007f6ddfdfc650, RSI=0x0000000000000000, RDI=0x00007f6ddfdfc570
R8 =0x00007f6ddd6256a0, R9 =0x00007f6ddd618db8, R10=0x0000000000000000, R11=0x00007f6ddd625700
R12=0x00007f6d5c686a80, R13=0x00007f6ddfdfc570, R14=0x00007f6c05eccc78, R15=0x0000000000000000
RIP=0x00007f6be8b25d12, EFLAGS=0x0000000000010246, CSGSFS=0x002b000000000033, ERR=0x0000000000000004
TRAPNO=0x000000000000000e
Top of Stack: (sp=0x00007f6ddfdfc550)
0x00007f6ddfdfc550: 00007f6ddfdfc570 a662eca985aa6800
0x00007f6ddfdfc560: 00007f6ddfdfc590 00007f6be8ae3708
0x00007f6ddfdfc570: 000000770000007c 0000005d0000006e
0x00007f6ddfdfc580: 0000000000000000 0000000001180470
0x00007f6ddfdfc590: 00007f6ddfdfc5a0 0000000000000000
0x00007f6ddfdfc5a0: 00007f6d5ee78b00 00007f79a8ffa838
0x00007f6ddfdfc5b0: 0000000000000000 00007f79a8eb00fe
0x00007f6ddfdfc5c0: 0000000000000000 0000000000000000
0x00007f6ddfdfc5d0: 0000000000000020 00007f79a8ffa838
0x00007f6ddfdfc5e0: 00007f6d5ca51370 00007f6c07e19cd9
0x00007f6ddfdfc5f0: 00007f79a8ffbee8 00007f79a8e577a2
0x00007f6ddfdfc600: 0000000000000040 00007f6ddd4beda0
0x00007f6ddfdfc610: 00007f6c05eccc80 a662eca985aa6800
0x00007f6ddfdfc620: 00007f6ddd4beda0 00007f6ddfdfc650
0x00007f6ddfdfc630: 00007ffc745823a8 00007ffc74582560
0x00007f6ddfdfc640: 00007f6c05eccc78 00007f6be8a1d762
0x00007f6ddfdfc650: 0000000000011c30 0000000000000470
0x00007f6ddfdfc660: 000004a0000011c1 0000000000000002
0x00007f6ddfdfc670: 0000000000000011 000000000000008e
0x00007f6ddfdfc680: 000000790000007c 000000e90000007f
0x00007f6ddfdfc690: 00007f6d5ca3edb0 ffffffffffffffb8
0x00007f6ddfdfc6a0: 0000000000011c00 00007f6dc8000020
0x00007f6ddfdfc6b0: 00007ffc74582560 00007f6bbfe70470
0x00007f6ddfdfc6c0: 00007f6bc59ae680 a662eca985aa6800
0x00007f6ddfdfc6d0: 00007f6bc59ae680 00007f6c05ecc318
0x00007f6ddfdfc6e0: 0000000000000036 00007ffc745823a8
0x00007f6ddfdfc6f0: 00007ffc74582560 00007f6c05eccc78
0x00007f6ddfdfc700: 0000000000000000 00007f79a95cb1ee
0x00007f6ddfdfc710: fffffffffffffff8 0000000000000036
0x00007f6ddfdfc720: 00007ffc745823a8 00007ffc74582560
0x00007f6ddfdfc730: 00007f6d5c6de6c0 00007f79a95cb2dc
0x00007f6ddfdfc740: 00007ffc745823a8 00007f6ddfdfca40
Instructions: (pc=0x00007f6be8b25d12)
0x00007f6be8b25cf2: 89 fb 48 83 ec 10 64 48 8b 04 25 28 00 00 00 48
0x00007f6be8b25d02: 89 44 24 08 31 c0 48 8d 05 19 2b 2f 1f 48 8b 30
0x00007f6be8b25d12: 48 8b 06 ff 50 30 48 8b 54 24 08 64 48 33 14 25
0x00007f6be8b25d22: 28 00 00 00 75 09 48 83 c4 10 48 89 d8 5b c3 e8
Register to memory mapping:
RAX=0x00007f6c07e18828: <offset 0x1f497828> in /opt/tomcat/temp/onnxruntime-java757573562719520016/libonnxruntime_providers_cuda.so at 0x00007f6be8981000
RBX=0x00007f6ddfdfc570 is pointing into the stack for thread: 0x00007f6ef6394000
RCX=0x0000000000000006 is an unknown value
RDX=0x0000000000000000 is an unknown value
RSP=0x00007f6ddfdfc550 is pointing into the stack for thread: 0x00007f6ef6394000
RBP=0x00007f6ddfdfc650 is pointing into the stack for thread: 0x00007f6ef6394000
RSI=0x0000000000000000 is an unknown value
RDI=0x00007f6ddfdfc570 is pointing into the stack for thread: 0x00007f6ef6394000
R8 =0x00007f6ddd6256a0: <offset 0x2256a0> in /lib64/libstdc++.so.6 at 0x00007f6ddd400000
R9 =0x00007f6ddd618db8: _ZTINSt6locale5facetE+0 in /lib64/libstdc++.so.6 at 0x00007f6ddd400000
R10=0x0000000000000000 is an unknown value
R11=0x00007f6ddd625700: <offset 0x225700> in /lib64/libstdc++.so.6 at 0x00007f6ddd400000
R12=0x00007f6d5c686a80 is an unknown value
R13=0x00007f6ddfdfc570 is pointing into the stack for thread: 0x00007f6ef6394000
R14=0x00007f6c05eccc78: <offset 0x1d54bc78> in /opt/tomcat/temp/onnxruntime-java757573562719520016/libonnxruntime_providers_cuda.so at 0x00007f6be8981000
R15=0x0000000000000000 is an unknown value
Stack: [0x00007f6ddfdc0000,0x00007f6ddfe00000], sp=0x00007f6ddfdfc550, free space=241k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [libonnxruntime_providers_cuda.so+0x1a4d12]
C 0x0000000000000470
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j ai.onnxruntime.OrtSession$SessionOptions.addCUDA(JJI)V+0
j ai.onnxruntime.OrtSession$SessionOptions.addCUDA(I)V+19
j ai.onnxruntime.OrtSession$SessionOptions.addCUDA()V+2
j ai.djl.onnxruntime.engine.OrtEngine.hasCapability(Ljava/lang/String;)Z+29
j ai.djl.engine.Engine.defaultDevice()Lai/djl/Device;+10
j ai.djl.ndarray.BaseNDManager.defaultDevice()Lai/djl/Device;+4
j ai.djl.ndarray.BaseNDManager.<init>(Lai/djl/ndarray/NDManager;Lai/djl/Device;)V+39
j ai.djl.onnxruntime.engine.OrtNDManager.<init>(Lai/djl/ndarray/NDManager;Lai/djl/Device;Lai/onnxruntime/OrtEnvironment;)V+3
j ai.djl.onnxruntime.engine.OrtNDManager.<init>(Lai/djl/ndarray/NDManager;Lai/djl/Device;Lai/onnxruntime/OrtEnvironment;Lai/djl/onnxruntime/engine/OrtNDManager$1;)V+4
j ai.djl.onnxruntime.engine.OrtNDManager$SystemManager.<init>()V+15
j ai.djl.onnxruntime.engine.OrtNDManager.<clinit>()V+4
v ~StubRoutines::call_stub
j ai.djl.onnxruntime.engine.OrtEngine.newBaseManager(Lai/djl/Device;)Lai/djl/ndarray/NDManager;+0
j ai.djl.onnxruntime.engine.OrtEngine.newModel(Ljava/lang/String;Lai/djl/Device;)Lai/djl/Model;+7
j ai.djl.Model.newInstance(Ljava/lang/String;Lai/djl/Device;Ljava/lang/String;)Lai/djl/Model;+23
j ai.djl.repository.zoo.BaseModelLoader.createModel(Ljava/nio/file/Path;Ljava/lang/String;Lai/djl/Device;Lai/djl/nn/Block;Ljava/util/Map;Ljava/lang/String;)Lai/djl/Model;+4
j ai.djl.repository.zoo.BaseModelLoader.loadModel(Lai/djl/repository/zoo/Criteria;)Lai/djl/repository/zoo/ZooModel;+506
j ai.djl.repository.zoo.Criteria.loadModel()Lai/djl/repository/zoo/ZooModel;+524
What have you tried to solve it?
I made a change to ai.djl.engine.Engine.java, and the problem no longer reproduces
public Device defaultDevice() {
if (defaultDevice == null) {
if (CudaUtils.getGpuCount() > 0 && hasCapability(StandardCapabilities.CUDA)) { // check gpu-count first
defaultDevice = Device.gpu();
} else {
defaultDevice = Device.cpu();
}
}
return defaultDevice;
}