tensorrtllm_backend disaggregated_serving

disaggregated_serving_bls: CPU usage

Open gary-wjc opened this issue 11 months ago • 1 comments

I followed the guidance in https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/all_models/disaggregated_serving to create all models based on qwen2.5-14b, with tp_size=2 for context and generation models. Everything looks fine during the launch of tritonserver. However, when a request is sent using inflight_batcher_llm/client/end_to_end_grpc_client.py, the following error message appears:

Received an error from server:
in ensemble 'ensemble', Failed to process the request(s) for model 'disaggregated_serving_bls_0_0', message: TritonModelException: Context model context failed with error: Context-only and generation-only requests are NOT currently supported in orchestrator mode. (/tmp/tritonbuild/tensorrtllm/inflight_batcher_llm/src/utils.cc:647)
1       0x7ff2e008e881 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x13881) [0x7ff2e008e881]
2       0x7ff2e0098001 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1d001) [0x7ff2e0098001]
3       0x7ff2e00a3744 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x28744) [0x7ff2e00a3744]
4       0x7ff2e00920d5 TRITONBACKEND_ModelInstanceExecute + 101
5       0x7ff2f03070b4 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a70b4) [0x7ff2f03070b4]
6       0x7ff2f030742b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a742b) [0x7ff2f030742b]
7       0x7ff2f0425ccd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2c5ccd) [0x7ff2f0425ccd]
8       0x7ff2f030b864 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ab864) [0x7ff2f030b864]
9       0x7ff2efbcc253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7ff2efbcc253]
10      0x7ff2ef95bac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7ff2ef95bac3]
11      0x7ff2ef9eca04 clone + 68

The triton container version I currently use is 24.10, with v0.14.0 TensorRT-LLM engine. @kaiyux

Dec 02 '24 07:12 gary-wjc

tensorrtllm_backend tensorrtllm_backend copied to clipboard

disaggregated_serving_bls: CPU usage

tensorrtllm_backend
tensorrtllm_backend copied to clipboard