tensorrtllm_backend
tensorrtllm_backend copied to clipboard
disaggregated_serving_bls: CPU usage
I followed the guidance in https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/all_models/disaggregated_serving to create all models based on qwen2.5-14b, with tp_size=2 for context and generation models. Everything looks fine during the launch of tritonserver. However, when a request is sent using inflight_batcher_llm/client/end_to_end_grpc_client.py, the following error message appears:
Received an error from server:
in ensemble 'ensemble', Failed to process the request(s) for model 'disaggregated_serving_bls_0_0', message: TritonModelException: Context model context failed with error: Context-only and generation-only requests are NOT currently supported in orchestrator mode. (/tmp/tritonbuild/tensorrtllm/inflight_batcher_llm/src/utils.cc:647)
1 0x7ff2e008e881 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x13881) [0x7ff2e008e881]
2 0x7ff2e0098001 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1d001) [0x7ff2e0098001]
3 0x7ff2e00a3744 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x28744) [0x7ff2e00a3744]
4 0x7ff2e00920d5 TRITONBACKEND_ModelInstanceExecute + 101
5 0x7ff2f03070b4 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a70b4) [0x7ff2f03070b4]
6 0x7ff2f030742b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a742b) [0x7ff2f030742b]
7 0x7ff2f0425ccd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2c5ccd) [0x7ff2f0425ccd]
8 0x7ff2f030b864 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ab864) [0x7ff2f030b864]
9 0x7ff2efbcc253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7ff2efbcc253]
10 0x7ff2ef95bac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7ff2ef95bac3]
11 0x7ff2ef9eca04 clone + 68
The triton container version I currently use is 24.10, with v0.14.0 TensorRT-LLM engine. @kaiyux