onnxruntime_backend perf_analyzer failing with inputs -1 dimension

trafficstars

Perf analyzer failing to create concurrency manager with the following error:

[Model Analyzer] Running perf_analyzer failed with exit status 1 : error: failed to create concurrency manager: input attention_mask contains dynamic shape, provide shapes to send along with the request

Using the following inputs which seem to be causing the issue:

    {
        name: "attention_mask"
        data_type: TYPE_BOOL
        dims: [
            -1
        ]
    },
    {
        name: "input_ids"
        data_type: TYPE_INT64
        dims: [
            -1
        ]
    }

Using r22.04 version of model analyzer.

Please suggest how we can analyze such a model

May 17 '22 18:05 aishanibhalla

Are you using the perf_analyzer_flags option of Model Analyzer and supplying shapes?:

https://github.com/triton-inference-server/model_analyzer/blob/main/docs/config.md#perf-analyzer-flags

May 17 '22 18:05 matthewkotila

Thanks for the pointer @matthewkotila

I set the shape for mentioned inputs as follows:

# Allows custom configuration of perf analyzer instances used by model analyzer
perf_analyzer_flags:
  shape:
    - attention_mask:1024
    - input_ids:1024

It gives me the following error:

[Model Analyzer] Running perf_analyzer failed with exit status 1 : *** Measurement Settings ***
  Batch size: 1
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 50
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
Failed to maintain requested inference load. Worker thread(s) failed to generate concurrent requests.
Thread [0] had error: onnx runtime error 6: Non-zero status code returned while running Tile node. Name:'Tile_1400' Status Message: /workspace/onnxruntime/onnxruntime/core/framework/op_kernel.cc:78 OrtValue* onnxruntime::OpKernelContext::OutputMLValue(int, const onnxruntime::TensorShape&) status.IsOK() was false. Tensor shape cannot contain any negative value

May 17 '22 21:05 aishanibhalla

Looks like the error is being returned from the model. It could be the sensitivity to data. Can you try adding the input-data: zero to the perf_analyzer_flags section?

Are you able to perform inference on the model outside Triton?

May 18 '22 16:05 Tabrizian

profile.yaml

perf_analyzer_flags:
        input-data: zero
        shape:
                - input_ids:359
                - attention_mask:359

Gives the following failure:

[Model Analyzer] Running perf_analyzer failed with exit status 1 : *** Measurement Settings ***
  Batch size: 1
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 50
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
Failed to maintain requested inference load. Worker thread(s) failed to generate concurrent requests.
Thread [0] had error: Socket closed

May 19 '22 23:05 aishanibhalla

It looks like the server is being closed. Can you start a tritonserver outside model analyzer and use perf_analyzer -m <your model name> --shape input_ids:359 --shape attention_mask:359 --input-data zero --measurement-mode count_windows and share the tritonserver logs to see why the server is closed?

May 20 '22 14:05 Tabrizian

@Tabrizian below are the logs for perf_analyzer and tritonserver

perf_analyzer logs

root@hostnc24rsv3-2:/workspace# perf_analyzer -m my_model --shape input_ids:359 --shape attention_mask:359 --input-data zero --measurement-mode count_windows
*** Measurement Settings ***
  Batch size: 1
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 50
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
Failed to update context stat: Timer not set correctly.
Failed to maintain requested inference load. Worker thread(s) failed to generate concurrent requests.
Thread [0] had error: failed to parse the request JSON buffer: The document is empty. at 0

tritonserver logs

Signal (11) received.
 0# 0x000055AC4241F7E9 in tritonserver
 1# 0x00007F3DE20390C0 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# 0x00007F3D3371296A in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
 3# 0x00007F3D3371465A in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
 4# 0x00007F3D33321668 in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
 5# 0x00007F3C60058809 in /opt/tritonserver/backends/onnxruntime/libonnxruntime_providers_cuda.so
 6# 0x00007F3C5FE7D542 in /opt/tritonserver/backends/onnxruntime/libonnxruntime_providers_cuda.so
 7# 0x00007F3D338184A5 in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
 8# 0x00007F3D33802789 in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
 9# 0x00007F3D3380486C in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
10# 0x00007F3D3325AD92 in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
11# 0x00007F3D3325AFB8 in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
12# 0x00007F3D3320242D in /opt/tritonserver/backends/onnxruntime/libonnxruntime.so
13# 0x00007F3D33DB87BD in /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
14# 0x00007F3D33DCE203 in /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
15# TRITONBACKEND_ModelInstanceExecute in /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
16# 0x00007F3DE28E5D9A in /opt/tritonserver/bin/../lib/libtritonserver.so
17# 0x00007F3DE28E6757 in /opt/tritonserver/bin/../lib/libtritonserver.so
18# 0x00007F3DE29A1AB1 in /opt/tritonserver/bin/../lib/libtritonserver.so
19# 0x00007F3DE28DFC27 in /opt/tritonserver/bin/../lib/libtritonserver.so
20# 0x00007F3DE242ADE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
21# 0x00007F3DE3631609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
22# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

May 23 '22 23:05 aishanibhalla

Thanks for providing further details. It looks like the onnxruntime backend is segfaulting. This issue is outside the scope of Model Analyzer. I'll be transferring this issue to the onnxruntime backend repo. Would be great if you can provide your model file and model configuration so that it can be further investigated.

May 24 '22 15:05 Tabrizian

cc @pranavsharma @askhade

May 24 '22 15:05 Tabrizian

@Tabrizian -- I ran it again with the same environment we've been running the model with and this is error I received when I ran the perf_analyzer I ran the perf_analyzer with the following parameters -

perf_analyzer -m bart_model --shape input_ids:359 --shape attention_mask:359 --input-data zero --measurement-mode count_windows

Error:

Failed to maintain requested inference load. Worker thread(s) failed to generate concurrent requests.
Thread [0] had error: onnx runtime error 1: Non-zero status code returned while running Loop node. Name:'Loop_3206' Status Message: Non-zero status code returned while running ConcatFromSequence node. Name:'ConcatFromSequence_3234' Status Message: concat.cc:62 PrepareForCompute Must have 1 or more inputs

tritonserver logs:

I0525 19:13:43.092956 93 http_server.cc:3142] HTTP request: 0 /v2/models/bart_model/stats
I0525 19:22:25.288387 93 http_server.cc:3142] HTTP request: 0 /v2/models/bart_model
I0525 19:22:25.289165 93 http_server.cc:3142] HTTP request: 0 /v2/models/bart_model/config
I0525 19:22:25.290315 93 http_server.cc:3142] HTTP request: 0 /v2
I0525 19:22:25.291264 93 http_server.cc:3142] HTTP request: 2 /v2/models/bart_model/infer
I0525 19:22:25.291348 93 infer_request.cc:707] prepared: [0x0x7f702000acb0] request id: , model: bart_model, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f702000a868] input: min_length.t.b, type: INT64, original shape: [1,1], batch + shape: [1,1], shape: [1]
[0x0x7f70200048c8] input: max_length.t.b, type: INT64, original shape: [1,1], batch + shape: [1,1], shape: [1]
[0x0x7f70200036c8] input: length_penalty.t.b, type: FP64, original shape: [1,1], batch + shape: [1,1], shape: [1]
[0x0x7f7020008fe8] input: decoder_start_token_id.t.b, type: INT64, original shape: [1,1], batch + shape: [1,1], shape: [1]
[0x0x7f7020010128] input: beam.t.b, type: INT64, original shape: [1,1], batch + shape: [1,1], shape: [1]
[0x0x7f7020011928] input: input_ids, type: INT64, original shape: [1,359], batch + shape: [1,359], shape: [359]
[0x0x7f7020013038] input: attention_mask, type: BOOL, original shape: [1,359], batch + shape: [1,359], shape: [359]
override inputs:
inputs:
[0x0x7f7020013038] input: attention_mask, type: BOOL, original shape: [1,359], batch + shape: [1,359], shape: [359]
[0x0x7f7020011928] input: input_ids, type: INT64, original shape: [1,359], batch + shape: [1,359], shape: [359]
[0x0x7f7020010128] input: beam.t.b, type: INT64, original shape: [1,1], batch + shape: [1,1], shape: [1]
[0x0x7f7020008fe8] input: decoder_start_token_id.t.b, type: INT64, original shape: [1,1], batch + shape: [1,1], shape: [1]
[0x0x7f70200036c8] input: length_penalty.t.b, type: FP64, original shape: [1,1], batch + shape: [1,1], shape: [1]
[0x0x7f70200048c8] input: max_length.t.b, type: INT64, original shape: [1,1], batch + shape: [1,1], shape: [1]
[0x0x7f702000a868] input: min_length.t.b, type: INT64, original shape: [1,1], batch + shape: [1,1], shape: [1]
original requested outputs:
output_ids
requested outputs:
output_ids

I0525 19:22:25.291808 93 http_server.cc:3142] HTTP request: 0 /v2/models/bart_model/stats
I0525 19:22:25.292563 93 onnxruntime.cc:2597] model bart_model, instance bart_model_0, executing 1 requests
I0525 19:22:25.292589 93 onnxruntime.cc:1417] TRITONBACKEND_ModelExecute: Running bart_model_0 with 1 requests
2022-05-25 19:22:25.294237169 [I:onnxruntime:log, bfc_arena.cc:26 BFCArena] Creating BFCArena for Cuda with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 memory limit: 18446744073709551615 arena_extend_strategy: 0
2022-05-25 19:22:25.294264967 [V:onnxruntime:log, bfc_arena.cc:62 BFCArena] Creating 21 bins of max chunk size 256 to 268435456
2022-05-25 19:22:25.294277867 [I:onnxruntime:log, bfc_arena.cc:317 AllocateRawInternal] Extending BFCArena for Cuda. bin_num:1 (requested) num_bytes: 359 (actual) rounded_bytes:512
2022-05-25 19:22:25.294477356 [I:onnxruntime:log, bfc_arena.cc:197 Extend] Extended allocation by 1048576 bytes.
2022-05-25 19:22:25.294495856 [I:onnxruntime:log, bfc_arena.cc:200 Extend] Total allocated bytes: 1048576
2022-05-25 19:22:25.294509455 [I:onnxruntime:log, bfc_arena.cc:203 Extend] Allocated memory at 0x7f6ff0800000 to 0x7f6ff0900000
2022-05-25 19:22:25.294838438 [I:onnxruntime:, sequential_executor.cc:172 Execute] Begin execution
2022-05-25 19:22:25.294922534 [I:onnxruntime:log, bfc_arena.cc:317 AllocateRawInternal] Extending BFCArena for CudaPinned. bin_num:0 (requested) num_bytes: 16 (actual) rounded_bytes:256
2022-05-25 19:22:25.294993330 [I:onnxruntime:log, bfc_arena.cc:197 Extend] Extended allocation by 1048576 bytes.
2022-05-25 19:22:25.295007129 [I:onnxruntime:log, bfc_arena.cc:200 Extend] Total allocated bytes: 1048576
2022-05-25 19:22:25.295019029 [I:onnxruntime:log, bfc_arena.cc:203 Extend] Allocated memory at 0x7f71b9a00400 to 0x7f71b9b00400
2022-05-25 19:22:25.295128623 [I:onnxruntime:log, bfc_arena.cc:317 AllocateRawInternal] Extending BFCArena for CUDA_CPU. bin_num:0 (requested) num_bytes: 16 (actual) rounded_bytes:256
2022-05-25 19:22:25.295153522 [I:onnxruntime:log, bfc_arena.cc:197 Extend] Extended allocation by 1048576 bytes.
2022-05-25 19:22:25.295164021 [I:onnxruntime:log, bfc_arena.cc:200 Extend] Total allocated bytes: 1048576
2022-05-25 19:22:25.295171921 [I:onnxruntime:log, bfc_arena.cc:203 Extend] Allocated memory at 0x7f71f8065380 to 0x7f71f8165380
2022-05-25 19:22:25.295381310 [I:onnxruntime:log, bfc_arena.cc:317 AllocateRawInternal] Extending BFCArena for Cuda. bin_num:11 (requested) num_bytes: 735232 (actual) rounded_bytes:735232
2022-05-25 19:22:25.295614798 [I:onnxruntime:log, bfc_arena.cc:197 Extend] Extended allocation by 2097152 bytes.
2022-05-25 19:22:25.295633497 [I:onnxruntime:log, bfc_arena.cc:200 Extend] Total allocated bytes: 3145728
2022-05-25 19:22:25.295645896 [I:onnxruntime:log, bfc_arena.cc:203 Extend] Allocated memory at 0x7f6ff0a00000 to 0x7f6ff0c00000
2022-05-25 19:22:25.296074674 [I:onnxruntime:log, bfc_arena.cc:317 AllocateRawInternal] Extending BFCArena for Cuda. bin_num:11 (requested) num_bytes: 735232 (actual) rounded_bytes:735232
2022-05-25 19:22:25.296291263 [I:onnxruntime:log, bfc_arena.cc:197 Extend] Extended allocation by 4194304 bytes.
2022-05-25 19:22:25.296310262 [I:onnxruntime:log, bfc_arena.cc:200 Extend] Total allocated bytes: 7340032
2022-05-25 19:22:25.296323361 [I:onnxruntime:log, bfc_arena.cc:203 Extend] Allocated memory at 0x7f6ff0c00000 to 0x7f6ff1000000
2022-05-25 19:22:25.296467054 [I:onnxruntime:log, bfc_arena.cc:317 AllocateRawInternal] Extending BFCArena for Cuda. bin_num:13 (requested) num_bytes: 4124192 (actual) rounded_bytes:4124416
2022-05-25 19:22:25.296667444 [I:onnxruntime:log, bfc_arena.cc:197 Extend] Extended allocation by 8388608 bytes.
2022-05-25 19:22:25.296685143 [I:onnxruntime:log, bfc_arena.cc:200 Extend] Total allocated bytes: 15728640
2022-05-25 19:22:25.296697942 [I:onnxruntime:log, bfc_arena.cc:203 Extend] Allocated memory at 0x7f6ff1000000 to 0x7f6ff1800000
2022-05-25 19:22:25.302946720 [I:onnxruntime:, sequential_executor.cc:172 Execute] Begin execution
2022-05-25 19:22:25.303141110 [I:onnxruntime:, sequential_executor.cc:172 Execute] Begin execution
2022-05-25 19:22:25.303279703 [I:onnxruntime:, sequential_executor.cc:172 Execute] Begin execution
2022-05-25 19:22:25.303408496 [I:onnxruntime:, sequential_executor.cc:172 Execute] Begin execution
2022-05-25 19:22:25.303460494 [E:onnxruntime:, sequential_executor.cc:364 Execute] Non-zero status code returned while running ConcatFromSequence node. Name:'ConcatFromSequence_3234' Status Message: concat.cc:62 PrepareForCompute Must have 1 or more inputs
2022-05-25 19:22:25.303482993 [E:onnxruntime:, sequential_executor.cc:364 Execute] Non-zero status code returned while running Loop node. Name:'Loop_3206' Status Message: Non-zero status code returned while running ConcatFromSequence node. Name:'ConcatFromSequence_3234' Status Message: concat.cc:62 PrepareForCompute Must have 1 or more inputs

May 25 '22 19:05 aishanibhalla

Unable to perf analysis on the model, onnxruntime error as follows. Could someone please help cc @pranavsharma @askhade

[Model Analyzer] Profiling bart_model_config_default: client batch size=1, concurrency=1
[Model Analyzer] Running perf_analyzer failed with exit status 1 : *** Measurement Settings ***
  Batch size: 1
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 50
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
Failed to maintain requested inference load. Worker thread(s) failed to generate concurrent requests.
Thread [0] had error: onnx runtime error 1: Non-zero status code returned while running Loop node. Name:'Loop_3206' Status Message: Non-zero status code returned while running ConcatFromSequence node. Name:'ConcatFromSequence_3234' Status Message: concat.cc:62 PrepareForCompute Must have 1 or more inputs

Jun 02 '22 01:06 aishanibhalla

@aishanibhalla The error is coming from the model itself. Are you able to perform inference on the model outside Triton?

Jun 02 '22 14:06 Tabrizian

@Tabrizian

Yes, I'm able to run this model just fine with onnxruntime.

I wonder how the model analyzer is running the model in order to profile it.

I'm using the CUDAExecutionProvider as a provider and the following session options.

sess_options = SessionOptions()
sess_options.log_severity_level = 4
sess_options.graph_optimization_level = GraphOptimizationLevel.ORT_ENABLE_ALL
ort_sess = onnxruntime.InferenceSession(model_path, sess_options, providers=['CUDAExecutionProvider'])

Jun 06 '22 23:06 aishanibhalla

The first thing to do is to check if the model can be inferenced using ORT alone (that is without Triton). Even there I would first check if this is an execution provider issue, hence first run with CPU Execution provider and then try CUDA. Please post the results here.

Jun 06 '22 23:06 pranavsharma

@pranavsharma The decoder attention is NOT supported by cpu yet. For this model, we need to use onnx gpu runtime.

Jun 07 '22 00:06 aishanibhalla

onnxruntime_backend onnxruntime_backend copied to clipboard

perf_analyzer failing with inputs -1 dimension

onnxruntime_backend
onnxruntime_backend copied to clipboard