tensorrtllm_backend
tensorrtllm_backend copied to clipboard
Triton server is running, but no response returned.
The server seems to be ok with the following log.
I1212 03:29:51.067415 37860 server.cc:674]
+----------------+---------+--------+
| Model | Version | Status |
+----------------+---------+--------+
| ensemble | 1 | READY |
| postprocessing | 1 | READY |
| preprocessing | 1 | READY |
| tensorrt_llm | 1 | READY |
+----------------+---------+--------+
I1212 03:29:51.131831 37860 metrics.cc:810] Collecting metrics for GPU 0: Tesla V100-SXM2-32GB
I1212 03:29:51.131877 37860 metrics.cc:810] Collecting metrics for GPU 1: Tesla V100-SXM2-32GB
I1212 03:29:51.132200 37860 metrics.cc:703] Collecting CPU metrics
I1212 03:29:51.132375 37860 tritonserver.cc:2435]
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.37.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memo |
| | ry binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | triton_model_repo |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| cuda_memory_pool_byte_size{1} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
I1212 03:29:51.451648 37860 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:10087
I1212 03:29:51.452030 37860 http_server.cc:3558] Started HTTPService at 0.0.0.0:10086
I1212 03:29:51.493032 37860 http_server.cc:187] Started Metrics Service at 0.0.0.0:10088
when I run curl -X POST localhost:10086/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": ""}', there is no response, why is that?
@sleepwalker2017 Hi, did you solve this? I'm facing the same problem as you and have no idea what happened.
I0328 09:48:48.039005 1588 server.cc:677]
+------------------+---------+--------+
| Model | Version | Status |
+------------------+---------+--------+
| ensemble | 1 | READY |
| postprocessing | 1 | READY |
| preprocessing | 1 | READY |
| tensorrt_llm | 1 | READY |
| tensorrt_llm_bls | 1 | READY |
+------------------+---------+--------+
I0328 09:48:49.670322 1588 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA A100-SXM4-80GB
I0328 09:48:49.720217 1588 metrics.cc:770] Collecting CPU metrics
I0328 09:48:49.720351 1588 tritonserver.cc:2508]
+----------------------------------+----------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+----------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.43.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_ |
| | policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data |
| | parameters statistics trace logging |
| model_repository_path[0] | /tensorrtllm_backend/triton_model_repo |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+----------------------------------------------------------------------------------------+
I0328 09:48:49.722078 1588 grpc_server.cc:2519] Started GRPCInferenceService at 0.0.0.0:8001
I0328 09:48:49.722273 1588 http_server.cc:4637] Started HTTPService at 0.0.0.0:8000
I0328 09:48:49.763159 1588 http_server.cc:320] Started Metrics Service at 0.0.0.0:8002
look like there is nothing wrong with the output, but no output with curl
curl -X POST localhost:8000/v2/models/tensorrt_llm_bls/generate -d '{"text_input": "What is machine learning?", "max_tokens": 200, "bad_words": "", "stop_words": ""}'
@byshiue @schetlur-nv If there is any information that I should provide?