onnxruntime_backend
onnxruntime_backend copied to clipboard
perf_analyzer failing with inputs -1 dimension
Perf analyzer failing to create concurrency manager with the following error:
[Model Analyzer] Running perf_analyzer failed with exit status 1 : error: failed to create concurrency manager: input attention_mask contains dynamic shape, provide shapes to send along with the request
Using the following inputs which seem to be causing the issue:
{
name: "attention_mask"
data_type: TYPE_BOOL
dims: [
-1
]
},
{
name: "input_ids"
data_type: TYPE_INT64
dims: [
-1
]
}
Using r22.04 version of model analyzer.
Please suggest how we can analyze such a model
Are you using the perf_analyzer_flags option of Model Analyzer and supplying shapes?:
https://github.com/triton-inference-server/model_analyzer/blob/main/docs/config.md#perf-analyzer-flags
Thanks for the pointer @matthewkotila
I set the shape for mentioned inputs as follows:
# Allows custom configuration of perf analyzer instances used by model analyzer
perf_analyzer_flags:
shape:
- attention_mask:1024
- input_ids:1024
It gives me the following error:
[Model Analyzer] Running perf_analyzer failed with exit status 1 : *** Measurement Settings ***
Batch size: 1
Using "count_windows" mode for stabilization
Minimum number of samples in each window: 50
Using synchronous calls for inference
Stabilizing using average latency
Request concurrency: 1
Failed to maintain requested inference load. Worker thread(s) failed to generate concurrent requests.
Thread [0] had error: onnx runtime error 6: Non-zero status code returned while running Tile node. Name:'Tile_1400' Status Message: /workspace/onnxruntime/onnxruntime/core/framework/op_kernel.cc:78 OrtValue* onnxruntime::OpKernelContext::OutputMLValue(int, const onnxruntime::TensorShape&) status.IsOK() was false. Tensor shape cannot contain any negative value
Looks like the error is being returned from the model. It could be the sensitivity to data. Can you try adding the input-data: zero to the perf_analyzer_flags section?
Are you able to perform inference on the model outside Triton?
profile.yaml
perf_analyzer_flags:
input-data: zero
shape:
- input_ids:359
- attention_mask:359
Gives the following failure:
[Model Analyzer] Running perf_analyzer failed with exit status 1 : *** Measurement Settings ***
Batch size: 1
Using "count_windows" mode for stabilization
Minimum number of samples in each window: 50
Using synchronous calls for inference
Stabilizing using average latency
Request concurrency: 1
Failed to maintain requested inference load. Worker thread(s) failed to generate concurrent requests.
Thread [0] had error: Socket closed
It looks like the server is being closed. Can you start a tritonserver outside model analyzer and use perf_analyzer -m <your model name> --shape input_ids:359 --shape attention_mask:359 --input-data zero --measurement-mode count_windows and share the tritonserver logs to see why the server is closed?
@Tabrizian below are the logs for perf_analyzer and tritonserver
perf_analyzer logs
root@hostnc24rsv3-2:/workspace# perf_analyzer -m my_model --shape input_ids:359 --shape attention_mask:359 --input-data zero --measurement-mode count_windows
*** Measurement Settings ***
Batch size: 1
Using "count_windows" mode for stabilization
Minimum number of samples in each window: 50
Using synchronous calls for inference
Stabilizing using average latency
Request concurrency: 1
Failed to update context stat: Timer not set correctly.
Failed to maintain requested inference load. Worker thread(s) failed to generate concurrent requests.
Thread [0] had error: failed to parse the request JSON buffer: The document is empty. at 0
tritonserver logs
Signal (11) received.
0# 0x000055AC4241F7E9 in tritonserver
1# 0x00007F3DE20390C0 in /usr/lib/x86_64-linux-gnu/libc.so.6
2# 0x00007F3D3371296A in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
3# 0x00007F3D3371465A in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
4# 0x00007F3D33321668 in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
5# 0x00007F3C60058809 in /opt/tritonserver/backends/onnxruntime/libonnxruntime_providers_cuda.so
6# 0x00007F3C5FE7D542 in /opt/tritonserver/backends/onnxruntime/libonnxruntime_providers_cuda.so
7# 0x00007F3D338184A5 in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
8# 0x00007F3D33802789 in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
9# 0x00007F3D3380486C in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
10# 0x00007F3D3325AD92 in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
11# 0x00007F3D3325AFB8 in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
12# 0x00007F3D3320242D in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
13# 0x00007F3D33DB87BD in /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
14# 0x00007F3D33DCE203 in /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
15# TRITONBACKEND_ModelInstanceExecute in /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
16# 0x00007F3DE28E5D9A in /opt/tritonserver/bin/../lib/libtritonserver.so
17# 0x00007F3DE28E6757 in /opt/tritonserver/bin/../lib/libtritonserver.so
18# 0x00007F3DE29A1AB1 in /opt/tritonserver/bin/../lib/libtritonserver.so
19# 0x00007F3DE28DFC27 in /opt/tritonserver/bin/../lib/libtritonserver.so
20# 0x00007F3DE242ADE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
21# 0x00007F3DE3631609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
22# clone in /usr/lib/x86_64-linux-gnu/libc.so.6
Thanks for providing further details. It looks like the onnxruntime backend is segfaulting. This issue is outside the scope of Model Analyzer. I'll be transferring this issue to the onnxruntime backend repo. Would be great if you can provide your model file and model configuration so that it can be further investigated.
cc @pranavsharma @askhade
@Tabrizian -- I ran it again with the same environment we've been running the model with and this is error I received when I ran the perf_analyzer
I ran the perf_analyzer with the following parameters -
perf_analyzer -m bart_model --shape input_ids:359 --shape attention_mask:359 --input-data zero --measurement-mode count_windows
Error:
Failed to maintain requested inference load. Worker thread(s) failed to generate concurrent requests.
Thread [0] had error: onnx runtime error 1: Non-zero status code returned while running Loop node. Name:'Loop_3206' Status Message: Non-zero status code returned while running ConcatFromSequence node. Name:'ConcatFromSequence_3234' Status Message: concat.cc:62 PrepareForCompute Must have 1 or more inputs
tritonserver logs:
I0525 19:13:43.092956 93 http_server.cc:3142] HTTP request: 0 /v2/models/bart_model/stats
I0525 19:22:25.288387 93 http_server.cc:3142] HTTP request: 0 /v2/models/bart_model
I0525 19:22:25.289165 93 http_server.cc:3142] HTTP request: 0 /v2/models/bart_model/config
I0525 19:22:25.290315 93 http_server.cc:3142] HTTP request: 0 /v2
I0525 19:22:25.291264 93 http_server.cc:3142] HTTP request: 2 /v2/models/bart_model/infer
I0525 19:22:25.291348 93 infer_request.cc:707] prepared: [0x0x7f702000acb0] request id: , model: bart_model, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f702000a868] input: min_length.t.b, type: INT64, original shape: [1,1], batch + shape: [1,1], shape: [1]
[0x0x7f70200048c8] input: max_length.t.b, type: INT64, original shape: [1,1], batch + shape: [1,1], shape: [1]
[0x0x7f70200036c8] input: length_penalty.t.b, type: FP64, original shape: [1,1], batch + shape: [1,1], shape: [1]
[0x0x7f7020008fe8] input: decoder_start_token_id.t.b, type: INT64, original shape: [1,1], batch + shape: [1,1], shape: [1]
[0x0x7f7020010128] input: beam.t.b, type: INT64, original shape: [1,1], batch + shape: [1,1], shape: [1]
[0x0x7f7020011928] input: input_ids, type: INT64, original shape: [1,359], batch + shape: [1,359], shape: [359]
[0x0x7f7020013038] input: attention_mask, type: BOOL, original shape: [1,359], batch + shape: [1,359], shape: [359]
override inputs:
inputs:
[0x0x7f7020013038] input: attention_mask, type: BOOL, original shape: [1,359], batch + shape: [1,359], shape: [359]
[0x0x7f7020011928] input: input_ids, type: INT64, original shape: [1,359], batch + shape: [1,359], shape: [359]
[0x0x7f7020010128] input: beam.t.b, type: INT64, original shape: [1,1], batch + shape: [1,1], shape: [1]
[0x0x7f7020008fe8] input: decoder_start_token_id.t.b, type: INT64, original shape: [1,1], batch + shape: [1,1], shape: [1]
[0x0x7f70200036c8] input: length_penalty.t.b, type: FP64, original shape: [1,1], batch + shape: [1,1], shape: [1]
[0x0x7f70200048c8] input: max_length.t.b, type: INT64, original shape: [1,1], batch + shape: [1,1], shape: [1]
[0x0x7f702000a868] input: min_length.t.b, type: INT64, original shape: [1,1], batch + shape: [1,1], shape: [1]
original requested outputs:
output_ids
requested outputs:
output_ids
I0525 19:22:25.291808 93 http_server.cc:3142] HTTP request: 0 /v2/models/bart_model/stats
I0525 19:22:25.292563 93 onnxruntime.cc:2597] model bart_model, instance bart_model_0, executing 1 requests
I0525 19:22:25.292589 93 onnxruntime.cc:1417] TRITONBACKEND_ModelExecute: Running bart_model_0 with 1 requests
2022-05-25 19:22:25.294237169 [I:onnxruntime:log, bfc_arena.cc:26 BFCArena] Creating BFCArena for Cuda with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 memory limit: 18446744073709551615 arena_extend_strategy: 0
2022-05-25 19:22:25.294264967 [V:onnxruntime:log, bfc_arena.cc:62 BFCArena] Creating 21 bins of max chunk size 256 to 268435456
2022-05-25 19:22:25.294277867 [I:onnxruntime:log, bfc_arena.cc:317 AllocateRawInternal] Extending BFCArena for Cuda. bin_num:1 (requested) num_bytes: 359 (actual) rounded_bytes:512
2022-05-25 19:22:25.294477356 [I:onnxruntime:log, bfc_arena.cc:197 Extend] Extended allocation by 1048576 bytes.
2022-05-25 19:22:25.294495856 [I:onnxruntime:log, bfc_arena.cc:200 Extend] Total allocated bytes: 1048576
2022-05-25 19:22:25.294509455 [I:onnxruntime:log, bfc_arena.cc:203 Extend] Allocated memory at 0x7f6ff0800000 to 0x7f6ff0900000
2022-05-25 19:22:25.294838438 [I:onnxruntime:, sequential_executor.cc:172 Execute] Begin execution
2022-05-25 19:22:25.294922534 [I:onnxruntime:log, bfc_arena.cc:317 AllocateRawInternal] Extending BFCArena for CudaPinned. bin_num:0 (requested) num_bytes: 16 (actual) rounded_bytes:256
2022-05-25 19:22:25.294993330 [I:onnxruntime:log, bfc_arena.cc:197 Extend] Extended allocation by 1048576 bytes.
2022-05-25 19:22:25.295007129 [I:onnxruntime:log, bfc_arena.cc:200 Extend] Total allocated bytes: 1048576
2022-05-25 19:22:25.295019029 [I:onnxruntime:log, bfc_arena.cc:203 Extend] Allocated memory at 0x7f71b9a00400 to 0x7f71b9b00400
2022-05-25 19:22:25.295128623 [I:onnxruntime:log, bfc_arena.cc:317 AllocateRawInternal] Extending BFCArena for CUDA_CPU. bin_num:0 (requested) num_bytes: 16 (actual) rounded_bytes:256
2022-05-25 19:22:25.295153522 [I:onnxruntime:log, bfc_arena.cc:197 Extend] Extended allocation by 1048576 bytes.
2022-05-25 19:22:25.295164021 [I:onnxruntime:log, bfc_arena.cc:200 Extend] Total allocated bytes: 1048576
2022-05-25 19:22:25.295171921 [I:onnxruntime:log, bfc_arena.cc:203 Extend] Allocated memory at 0x7f71f8065380 to 0x7f71f8165380
2022-05-25 19:22:25.295381310 [I:onnxruntime:log, bfc_arena.cc:317 AllocateRawInternal] Extending BFCArena for Cuda. bin_num:11 (requested) num_bytes: 735232 (actual) rounded_bytes:735232
2022-05-25 19:22:25.295614798 [I:onnxruntime:log, bfc_arena.cc:197 Extend] Extended allocation by 2097152 bytes.
2022-05-25 19:22:25.295633497 [I:onnxruntime:log, bfc_arena.cc:200 Extend] Total allocated bytes: 3145728
2022-05-25 19:22:25.295645896 [I:onnxruntime:log, bfc_arena.cc:203 Extend] Allocated memory at 0x7f6ff0a00000 to 0x7f6ff0c00000
2022-05-25 19:22:25.296074674 [I:onnxruntime:log, bfc_arena.cc:317 AllocateRawInternal] Extending BFCArena for Cuda. bin_num:11 (requested) num_bytes: 735232 (actual) rounded_bytes:735232
2022-05-25 19:22:25.296291263 [I:onnxruntime:log, bfc_arena.cc:197 Extend] Extended allocation by 4194304 bytes.
2022-05-25 19:22:25.296310262 [I:onnxruntime:log, bfc_arena.cc:200 Extend] Total allocated bytes: 7340032
2022-05-25 19:22:25.296323361 [I:onnxruntime:log, bfc_arena.cc:203 Extend] Allocated memory at 0x7f6ff0c00000 to 0x7f6ff1000000
2022-05-25 19:22:25.296467054 [I:onnxruntime:log, bfc_arena.cc:317 AllocateRawInternal] Extending BFCArena for Cuda. bin_num:13 (requested) num_bytes: 4124192 (actual) rounded_bytes:4124416
2022-05-25 19:22:25.296667444 [I:onnxruntime:log, bfc_arena.cc:197 Extend] Extended allocation by 8388608 bytes.
2022-05-25 19:22:25.296685143 [I:onnxruntime:log, bfc_arena.cc:200 Extend] Total allocated bytes: 15728640
2022-05-25 19:22:25.296697942 [I:onnxruntime:log, bfc_arena.cc:203 Extend] Allocated memory at 0x7f6ff1000000 to 0x7f6ff1800000
2022-05-25 19:22:25.302946720 [I:onnxruntime:, sequential_executor.cc:172 Execute] Begin execution
2022-05-25 19:22:25.303141110 [I:onnxruntime:, sequential_executor.cc:172 Execute] Begin execution
2022-05-25 19:22:25.303279703 [I:onnxruntime:, sequential_executor.cc:172 Execute] Begin execution
2022-05-25 19:22:25.303408496 [I:onnxruntime:, sequential_executor.cc:172 Execute] Begin execution
2022-05-25 19:22:25.303460494 [E:onnxruntime:, sequential_executor.cc:364 Execute] Non-zero status code returned while running ConcatFromSequence node. Name:'ConcatFromSequence_3234' Status Message: concat.cc:62 PrepareForCompute Must have 1 or more inputs
2022-05-25 19:22:25.303482993 [E:onnxruntime:, sequential_executor.cc:364 Execute] Non-zero status code returned while running Loop node. Name:'Loop_3206' Status Message: Non-zero status code returned while running ConcatFromSequence node. Name:'ConcatFromSequence_3234' Status Message: concat.cc:62 PrepareForCompute Must have 1 or more inputs
Unable to perf analysis on the model, onnxruntime error as follows. Could someone please help cc @pranavsharma @askhade
[Model Analyzer] Profiling bart_model_config_default: client batch size=1, concurrency=1
[Model Analyzer] Running perf_analyzer failed with exit status 1 : *** Measurement Settings ***
Batch size: 1
Using "count_windows" mode for stabilization
Minimum number of samples in each window: 50
Using synchronous calls for inference
Stabilizing using average latency
Request concurrency: 1
Failed to maintain requested inference load. Worker thread(s) failed to generate concurrent requests.
Thread [0] had error: onnx runtime error 1: Non-zero status code returned while running Loop node. Name:'Loop_3206' Status Message: Non-zero status code returned while running ConcatFromSequence node. Name:'ConcatFromSequence_3234' Status Message: concat.cc:62 PrepareForCompute Must have 1 or more inputs
@aishanibhalla The error is coming from the model itself. Are you able to perform inference on the model outside Triton?
@Tabrizian
Yes, I'm able to run this model just fine with onnxruntime.
I wonder how the model analyzer is running the model in order to profile it.
I'm using the CUDAExecutionProvider as a provider and the following session options.
sess_options = SessionOptions()
sess_options.log_severity_level = 4
sess_options.graph_optimization_level = GraphOptimizationLevel.ORT_ENABLE_ALL
ort_sess = onnxruntime.InferenceSession(model_path, sess_options, providers=['CUDAExecutionProvider'])
The first thing to do is to check if the model can be inferenced using ORT alone (that is without Triton). Even there I would first check if this is an execution provider issue, hence first run with CPU Execution provider and then try CUDA. Please post the results here.
@pranavsharma The decoder attention is NOT supported by cpu yet. For this model, we need to use onnx gpu runtime.