onnxruntime_backend
onnxruntime_backend copied to clipboard
Regressions for ORT backend
Description
We have seen regressions in terms of performance for an ONNX model:
- using the ORT backend
- with
Loop
andMemcpy
nodes (the latter is probably the most relevant)
Since version 21.12 of the container up to version 22.9 (this corresponds to the bump of ORT to version 1.10, and then progressively to 1.12.1). Specifically, we have found that the processing time is about 2~3 times higher, depending on the load sent.
Then, since the current version (22.10), we have some kind of memory error after a small amount of requests have been served, with the memory more or less increasing continously priori to the crash. An extract of the error log:
2022-11-22 11:05:44.304827904 [E:onnxruntime:log, cuda_call.cc:119 CudaCall] CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=82bca2dd778f ; expr=cudaStreamSynchronize(static_cast<cudaStream_t>(GetComputeStream()));
2022-11-22 11:05:44.305936978 [E:onnxruntime:log, cuda_call.cc:119 CudaCall] CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=82bca2dd778f ; expr=cudaMemcpyAsync(dst_data, src_data, bytes, cudaMemcpyHostToDevice, GetStream(kCudaStreamDefault));
2022-11-22 11:05:44.307103367 [E:onnxruntime:log, cuda_call.cc:119 CudaCall] CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=82bca2dd778f ; expr=cudaMemcpyAsync(dst_data, src_data, bytes, cudaMemcpyHostToDevice, GetStream(kCudaStreamDefault));
terminate called after throwing an instance of 'I1122 11:05:44.306511 19 grpc_server.cc:3800] Process for ModelInferHandler, rpc_ok=1, 0 step COMPLETE
onnxruntime::OnnxRuntimeException'
what(): /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:124 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:117 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=82bca2dd778f ; expr=cudaEventDestroy(read_event_);
And after a small while:
0# 0x0000560C9A6591D9 in tritonserver
1# 0x00007FCC9A6B4090 in /usr/lib/x86_64-linux-gnu/libc.so.6
2# gsignal in /usr/lib/x86_64-linux-gnu/libc.so.6
3# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
4# 0x00007FCC9AA6D911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
5# 0x00007FCC9AA7938C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
6# 0x00007FCC9AA78369 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
7# __gxx_personality_v0 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
8# 0x00007FCC9A873BEF in /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
9# _Unwind_Resume in /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
10# 0x00007FCC3609D5F2 in /home/app/tritonserver/backends/onnxruntime/libonnxruntime_providers_cuda.so
11# 0x00007FCC360B4D3D in /home/app/tritonserver/backends/onnxruntime/libonnxruntime_providers_cuda.so
12# 0x00007FCC4EAF45D1 in /home/app/tritonserver/backends/onnxruntime/libonnxruntime.so
13# 0x00007FCC4EAF5A4C in /home/app/tritonserver/backends/onnxruntime/libonnxruntime.so
14# 0x00007FCC4EA451A8 in /home/app/tritonserver/backends/onnxruntime/libonnxruntime.so
15# 0x00007FCC4F1D03F9 in /home/app/tritonserver/backends/onnxruntime/libonnxruntime.so
16# 0x00007FCC4F1D2CB5 in /home/app/tritonserver/backends/onnxruntime/libonnxruntime.so
17# 0x00007FCC4F0E00DD in /home/app/tritonserver/backends/onnxruntime/libonnxruntime.so
18# 0x00007FCC4F0E3341 in /home/app/tritonserver/backends/onnxruntime/libonnxruntime.so
19# 0x00007FCC4EC0EC98 in /home/app/tritonserver/backends/onnxruntime/libonnxruntime.so
20# 0x00007FCC36277489 in /home/app/tritonserver/backends/onnxruntime/libonnxruntime_providers_cuda.so
21# 0x00007FCC360930C2 in /home/app/tritonserver/backends/onnxruntime/libonnxruntime_providers_cuda.so
22# 0x00007FCC4F1E6F7A in /home/app/tritonserver/backends/onnxruntime/libonnxruntime.so
23# 0x00007FCC4F1D03F9 in /home/app/tritonserver/backends/onnxruntime/libonnxruntime.so
24# 0x00007FCC4F1D296C in /home/app/tritonserver/backends/onnxruntime/libonnxruntime.so
25# 0x00007FCC4EB24FDA in /home/app/tritonserver/backends/onnxruntime/libonnxruntime.so
26# 0x00007FCC4EB252C8 in /home/app/tritonserver/backends/onnxruntime/libonnxruntime.so
27# 0x00007FCC4EABDF4D in /home/app/tritonserver/backends/onnxruntime/libonnxruntime.so
28# 0x00007FCC745C8ADD in backends/onnxruntime/libtriton_onnxruntime.so
29# 0x00007FCC745DED9B in backends/onnxruntime/libtriton_onnxruntime.so
30# TRITONBACKEND_ModelInstanceExecute in backends/onnxruntime/libtriton_onnxruntime.so
31# 0x00007FCC9AF6B03A in /home/app/tritonserver/bin/../lib/libtritonserver.so
32# 0x00007FCC9AF6B9A7 in /home/app/tritonserver/bin/../lib/libtritonserver.so
33# 0x00007FCC9B03D841 in /home/app/tritonserver/bin/../lib/libtritonserver.so
34# 0x00007FCC9AF65CE7 in /home/app/tritonserver/bin/../lib/libtritonserver.so
35# 0x00007FCC9AAA5DE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
36# 0x00007FCC9BE1E609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
37# clone in /usr/lib/x86_64-linux-gnu/libc.so.6
To be noted:
- This is not the only model I serve using this container. The other modes are also ONNX models running with the ORT backend, they do not have
Loop
orMemcpy
nodes and work fine, without any degradation in performance/bug. - I do not experience any problems when running the model with ORT (no slowdown / bug)
Triton Information
- What version of Triton are you using?
As mentionned above, the bug is on the latest version of Triton (version 2.27.0), and I've experienced the slowdown which I believe to be linked on all versions between 2.17 and 2.26.
Are you using the Triton container or did you build it yourself?
I am using the container.
To Reproduce Steps to reproduce the behavior.
The model is an encoder-decoder model, the decoder is run several times in a loop. The model is serialized to ONNX and optimized with ORT kernels. Some of the nodes must run on CPU, so there are Memcpy nodes. I use the follwoing config:
name: "my_model"
backend: "onnxruntime"
max_batch_size: 16
input [
{
name: "x"
data_type: TYPE_INT64
dims: [ -1 ]
},
{
name: "y"
data_type: TYPE_INT64
dims: [ -1 ]
}
]
output [
{
name: "z"
data_type: TYPE_INT64
dims: [ -1 ]
}
]
instance_group [
{
count: 3
kind: KIND_GPU
}
]
parameters {
key: "execution_mode"
value: {
string_value: "1"
}
}
I've had the same problem with only 1 or 2 instance per GPU, and in mutliple hardware settings (mono/multi GPU).
Expected behavior No bug or regression in performance - if anything I'd expect gains.
Thank you for filing this detailed issue. We have filed a ticket to investigate this regression.