tensorrtllm_backend Triton server is running, but no response returned.

The server seems to be ok with the following log.

I1212 03:29:51.067415 37860 server.cc:674]
+----------------+---------+--------+
| Model          | Version | Status |
+----------------+---------+--------+
| ensemble       | 1       | READY  |
| postprocessing | 1       | READY  |
| preprocessing  | 1       | READY  |
| tensorrt_llm   | 1       | READY  |
+----------------+---------+--------+

I1212 03:29:51.131831 37860 metrics.cc:810] Collecting metrics for GPU 0: Tesla V100-SXM2-32GB
I1212 03:29:51.131877 37860 metrics.cc:810] Collecting metrics for GPU 1: Tesla V100-SXM2-32GB
I1212 03:29:51.132200 37860 metrics.cc:703] Collecting CPU metrics
I1212 03:29:51.132375 37860 tritonserver.cc:2435]
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                  |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                 |
| server_version                   | 2.37.0                                                                                                                                                 |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memo |
|                                  | ry binary_tensor_data parameters statistics trace logging                                                                                              |
| model_repository_path[0]         | triton_model_repo                                                                                                                                      |
| model_control_mode               | MODE_NONE                                                                                                                                              |
| strict_model_config              | 1                                                                                                                                                      |
| rate_limit                       | OFF                                                                                                                                                    |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                              |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                               |
| cuda_memory_pool_byte_size{1}    | 67108864                                                                                                                                               |
| min_supported_compute_capability | 6.0                                                                                                                                                    |
| strict_readiness                 | 1                                                                                                                                                      |
| exit_timeout                     | 30                                                                                                                                                     |
| cache_enabled                    | 0                                                                                                                                                      |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+

I1212 03:29:51.451648 37860 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:10087
I1212 03:29:51.452030 37860 http_server.cc:3558] Started HTTPService at 0.0.0.0:10086
I1212 03:29:51.493032 37860 http_server.cc:187] Started Metrics Service at 0.0.0.0:10088

when I run curl -X POST localhost:10086/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": ""}', there is no response, why is that?

Dec 12 '23 03:12 sleepwalker2017

@sleepwalker2017 Hi, did you solve this? I'm facing the same problem as you and have no idea what happened.

Mar 28 '24 10:03 jaywongs


I0328 09:48:48.039005 1588 server.cc:677] 
+------------------+---------+--------+
| Model            | Version | Status |
+------------------+---------+--------+
| ensemble         | 1       | READY  |
| postprocessing   | 1       | READY  |
| preprocessing    | 1       | READY  |
| tensorrt_llm     | 1       | READY  |
| tensorrt_llm_bls | 1       | READY  |
+------------------+---------+--------+

I0328 09:48:49.670322 1588 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA A100-SXM4-80GB
I0328 09:48:49.720217 1588 metrics.cc:770] Collecting CPU metrics
I0328 09:48:49.720351 1588 tritonserver.cc:2508] 
+----------------------------------+----------------------------------------------------------------------------------------+
| Option                           | Value                                                                                  |
+----------------------------------+----------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                 |
| server_version                   | 2.43.0                                                                                 |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_ |
|                                  | policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data  |
|                                  | parameters statistics trace logging                                                    |
| model_repository_path[0]         | /tensorrtllm_backend/triton_model_repo                                                 |
| model_control_mode               | MODE_NONE                                                                              |
| strict_model_config              | 1                                                                                      |
| rate_limit                       | OFF                                                                                    |
| pinned_memory_pool_byte_size     | 268435456                                                                              |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                               |
| min_supported_compute_capability | 6.0                                                                                    |
| strict_readiness                 | 1                                                                                      |
| exit_timeout                     | 30                                                                                     |
| cache_enabled                    | 0                                                                                      |
+----------------------------------+----------------------------------------------------------------------------------------+

I0328 09:48:49.722078 1588 grpc_server.cc:2519] Started GRPCInferenceService at 0.0.0.0:8001
I0328 09:48:49.722273 1588 http_server.cc:4637] Started HTTPService at 0.0.0.0:8000
I0328 09:48:49.763159 1588 http_server.cc:320] Started Metrics Service at 0.0.0.0:8002

look like there is nothing wrong with the output, but no output with curl

curl -X POST localhost:8000/v2/models/tensorrt_llm_bls/generate -d '{"text_input": "What is machine learning?", "max_tokens": 200, "bad_words": "", "stop_words": ""}'

@byshiue @schetlur-nv If there is any information that I should provide?

Mar 28 '24 11:03 jaywongs