server
server copied to clipboard
When there are multiple GPU, only one GPU is used
Description When there are multiple GPU, only one GPU is used.
Triton Information Container: nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
To Reproduce Follow the instrcution at https://github.com/triton-inference-server/tutorials/blob/main/Popular_Models_Guide/Llama2/trtllm_guide.md
docker run --rm -it --net host --shm-size=2g \
--ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-v /root/models/Meta-Llama-3.1-8B-Instruct:/root/.cache/huggingface \
-v /mnt/data/engines:/engines \
nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
pip install git+https://github.com/triton-inference-server/[email protected]
triton import -m llama-3.1-8b-instruct --backend tensorrtllm
triton start
The model configuration file (/root/models/llama-3.1-8b-instruct/config.pbtxt) is:
backend: "python"
max_batch_size: 256
model_transaction_policy {
decoupled: True
}
input [
{
name: "text_input"
data_type: TYPE_STRING
dims: [ 1 ]
},
{
name: "decoder_text_input"
data_type: TYPE_STRING
dims: [ 1 ]
optional: true
},
{
name: "image_input"
data_type: TYPE_FP16
dims: [ 3, -1, -1 ]
optional: true
},
{
name: "max_tokens"
data_type: TYPE_INT32
dims: [ 1 ]
},
{
name: "bad_words"
data_type: TYPE_STRING
dims: [ -1 ]
optional: true
},
{
name: "stop_words"
data_type: TYPE_STRING
dims: [ -1 ]
optional: true
},
{
name: "end_id"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
{
name: "pad_id"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
{
name: "top_k"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
{
name: "top_p"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "temperature"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "length_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "repetition_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "min_length"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
{
name: "presence_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "frequency_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "random_seed"
data_type: TYPE_UINT64
dims: [ 1 ]
optional: true
},
{
name: "return_log_probs"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "return_context_logits"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "return_generation_logits"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "beam_width"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
{
name: "stream"
data_type: TYPE_BOOL
dims: [ 1 ]
optional: true
},
{
name: "prompt_embedding_table"
data_type: TYPE_FP16
dims: [ -1, -1 ]
optional: true
},
{
name: "prompt_vocab_size"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
{
name: "embedding_bias_words"
data_type: TYPE_STRING
dims: [ -1 ]
optional: true
},
{
name: "embedding_bias_weights"
data_type: TYPE_FP32
dims: [ -1 ]
optional: true
},
{
name: "num_draft_tokens",
data_type: TYPE_INT32,
dims: [ 1 ]
optional: true
},
{
name: "use_draft_logits",
data_type: TYPE_BOOL,
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
}
]
output [
{
name: "text_output"
data_type: TYPE_STRING
dims: [ -1 ]
},
{
name: "cum_log_probs"
data_type: TYPE_FP32
dims: [ -1 ]
},
{
name: "output_log_probs"
data_type: TYPE_FP32
dims: [ -1, -1 ]
},
{
name: "context_logits"
data_type: TYPE_FP32
dims: [ -1, -1 ]
},
{
name: "generation_logits"
data_type: TYPE_FP32
dims: [ -1, -1, -1 ]
},
{
name: "batch_index"
data_type: TYPE_INT32
dims: [ 1 ]
}
]
parameters: {
key: "accumulate_tokens"
value: {
string_value: "${accumulate_tokens}"
}
}
parameters: {
key: "tensorrt_llm_model_name"
value: {
string_value: "tensorrt_llm"
}
}
parameters: {
key: "tensorrt_llm_draft_model_name"
value: {
string_value: ""
}
}
parameters: {
key: "multimodal_encoders_name"
value: {
string_value: "${multimodal_encoders_name}"
}
}
instance_group [
{
count: 1
kind : KIND_GPU
gpus: [ 0 ]
},
{
count: 1
kind: KIND_GPU
gpus: [ 1 ]
}
]
Clearly I have specified it to use gpu 0 and gpu 1.
The postprocessing, preprocessing and tensorrt_llm are left unchanged.
Expected behavior
The model should be loaded on gpu 0 and gpu 1, and can deal with requests based on load.
Here are what I got:
See the model only loaded on gpu 0.
When I do benchmark, there is also just gpu 0 is used:
Here is the running log of triton:
root@ubuntu22:/opt/tritonserver# triton start
triton - INFO - Starting a Triton Server locally with model repository: /root/models
triton - INFO - Reading server output...
I0927 07:14:19.053887 3017 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x73e104000000' with size 268435456"
I0927 07:14:19.060743 3017 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0927 07:14:19.060759 3017 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864"
I0927 07:14:19.290249 3017 model_lifecycle.cc:472] "loading: llama-3.1-8b-instruct:1"
I0927 07:14:19.290321 3017 model_lifecycle.cc:472] "loading: postprocessing:1"
I0927 07:14:19.290360 3017 model_lifecycle.cc:472] "loading: preprocessing:1"
I0927 07:14:19.290413 3017 model_lifecycle.cc:472] "loading: tensorrt_llm:1"
I0927 07:14:19.511929 3017 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm"
I0927 07:14:19.511965 3017 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19"
I0927 07:14:19.511970 3017 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19"
I0927 07:14:19.511973 3017 libtensorrtllm.cc:86] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
I0927 07:14:19.530974 3017 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to false
[TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search, medusa}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.12.0.dev2024080600 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 1048576
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1048576
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
I0927 07:14:22.499985 3017 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: llama-3.1-8b-instruct_0_0 (GPU device 0)"
I0927 07:14:22.500067 3017 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: llama-3.1-8b-instruct_1_0 (GPU device 1)"
I0927 07:14:23.339154 3017 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)"
I0927 07:14:23.406562 3017 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)"
I0927 07:14:24.257499 3017 model_lifecycle.cc:839] "successfully loaded 'llama-3.1-8b-instruct'"
I0927 07:14:25.953954 3017 model_lifecycle.cc:839] "successfully loaded 'preprocessing'"
I0927 07:14:26.009815 3017 model_lifecycle.cc:839] "successfully loaded 'postprocessing'"
[TensorRT-LLM][INFO] Loaded engine size: 15387 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 800.00 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 15380 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.12 GB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 4.63 GB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 39.50 GiB, available: 12.40 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 1429
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][WARNING] maxAttentionWindow and maxSequenceLen are too large for at least one sequence to fit in kvCache. they are reduced to 91456
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 1429
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 11.16 GiB for max tokens in paged KV cache (91456).
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
I0927 07:14:41.151772 3017 libtensorrtllm.cc:184] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_0_0"
I0927 07:14:41.152077 3017 model_lifecycle.cc:839] "successfully loaded 'tensorrt_llm'"
I0927 07:14:41.152199 3017 server.cc:604]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I0927 07:14:41.152246 3017 server.cc:631]
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0927 07:14:41.152284 3017 server.cc:674]
+-----------------------+---------+--------+
| Model | Version | Status |
+-----------------------+---------+--------+
| llama-3.1-8b-instruct | 1 | READY |
| postprocessing | 1 | READY |
| preprocessing | 1 | READY |
| tensorrt_llm | 1 | READY |
+-----------------------+---------+--------+
I0927 07:14:41.246701 3017 metrics.cc:877] "Collecting metrics for GPU 0: NVIDIA A100-PCIE-40GB"
I0927 07:14:41.246738 3017 metrics.cc:877] "Collecting metrics for GPU 1: NVIDIA A100-PCIE-40GB"
I0927 07:14:41.252099 3017 metrics.cc:770] "Collecting CPU metrics"
I0927 07:14:41.252238 3017 tritonserver.cc:2598]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.49.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | /root/models |
| model_control_mode | MODE_NONE |
| strict_model_config | 0 |
| model_config_name | |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| cuda_memory_pool_byte_size{1} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0927 07:14:41.254711 3017 grpc_server.cc:2463] "Started GRPCInferenceService at 0.0.0.0:8001"
I0927 07:14:41.254940 3017 http_server.cc:4694] "Started HTTPService at 0.0.0.0:8000"
I0927 07:14:41.296047 3017 http_server.cc:362] "Started Metrics Service at 0.0.0.0:8002"
Hi @gyr66 , thanks for your question. I believe, this is because the trt-llm engine is composed for 1 GPU by default in triton-cli, @rmccorm4 will correct me, if I'm wrong.
Hi @gyr66, thanks for raising this issue and thanks for trying the Triton CLI!
As Olga mentioned, yes the default configs produced are currently for a "quickstart" path and are pre-defined as a single Triton model instance of KIND_MODEL (it can be a multi-gpu model, but only a single Triton instance). KIND_MODEL denotes to Triton that the backend (TRT-LLM) will handle the device placement/setup as needed, for example loading a TP=2 engine on 2 GPUs in a single Triton model instance.
For multiple model instances, it would require further knowledge and of the TRT-LLM backend, and may not work exactly the same as other backends due to its use of MPI for communication in the current implementation.
There is a guide with more comprehensive details and documentation on the various components involved to serve multiple TRT-LLM model instances, please check it out: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama_multi_instance.md#running-multiple-instances-of-llama-model-on-multiple-gpus.
Hopefully the Triton CLI generated configs give you a good functional starting point for a single instance, and then can be tweaked afterwards by following this guide to support multi-instance.
CC @Tabrizian for viz
@gyr66 , let us know if there's anything else we can help you with. Feel free to close this issue
Thank you so much for your patient and detailed responses! I am wondering, if I don't use TP, could I simply start an independent server process for each GPU and place an NGINX load balancer in front? Would this be consistent with the leader mode?