server When there are multiple GPU, only one GPU is used

Description When there are multiple GPU, only one GPU is used.

Triton Information Container: nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3

To Reproduce Follow the instrcution at https://github.com/triton-inference-server/tutorials/blob/main/Popular_Models_Guide/Llama2/trtllm_guide.md

docker run --rm -it --net host --shm-size=2g \
    --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
    -v /root/models/Meta-Llama-3.1-8B-Instruct:/root/.cache/huggingface \
    -v /mnt/data/engines:/engines \
    nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3

pip install git+https://github.com/triton-inference-server/[email protected]

triton import -m llama-3.1-8b-instruct --backend tensorrtllm

triton start

The model configuration file (/root/models/llama-3.1-8b-instruct/config.pbtxt) is:


backend: "python"
max_batch_size: 256

model_transaction_policy {
  decoupled: True
}

input [
  {
    name: "text_input"
    data_type: TYPE_STRING
    dims: [ 1 ]
  },
  {
    name: "decoder_text_input"
    data_type: TYPE_STRING
    dims: [ 1 ]
    optional: true
  },
  {
    name: "image_input"
    data_type: TYPE_FP16
    dims: [ 3, -1, -1 ]
    optional: true
  },
  {
    name: "max_tokens"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
   name: "bad_words"
   data_type: TYPE_STRING
   dims: [ -1 ]
   optional: true
  },
  {
   name: "stop_words"
   data_type: TYPE_STRING
   dims: [ -1 ]
   optional: true
  },
  {
    name: "end_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "pad_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "top_k"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "length_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "min_length"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "presence_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "frequency_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    optional: true
  },
  {
    name: "return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_context_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_generation_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_width"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "stream"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
  },
  {
    name: "prompt_embedding_table"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    optional: true
  },
  {
    name: "prompt_vocab_size"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
      name: "embedding_bias_words"
      data_type: TYPE_STRING
      dims: [ -1 ]
      optional: true
  },
  {
      name: "embedding_bias_weights"
      data_type: TYPE_FP32
      dims: [ -1 ]
      optional: true
  },
  {
      name: "num_draft_tokens",
      data_type: TYPE_INT32,
      dims: [ 1 ]
      optional: true
  },
  {
      name: "use_draft_logits",
      data_type: TYPE_BOOL,
      dims: [ 1 ]
      reshape: { shape: [ ] }
      optional: true
  }
]
output [
  {
    name: "text_output"
    data_type: TYPE_STRING
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "context_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "generation_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1 ]
  },
  {
    name: "batch_index"
    data_type: TYPE_INT32
    dims: [ 1 ]
  }
]

parameters: {
  key: "accumulate_tokens"
  value: {
    string_value: "${accumulate_tokens}"
  }
}
parameters: {
  key: "tensorrt_llm_model_name"
  value: {
    string_value: "tensorrt_llm"
  }
}
parameters: {
  key: "tensorrt_llm_draft_model_name"
  value: {
    string_value: ""
  }
}
parameters: {
  key: "multimodal_encoders_name"
  value: {
    string_value: "${multimodal_encoders_name}"
  }
}

instance_group [
  {
    count: 1
    kind : KIND_GPU
    gpus: [ 0 ]
  },
  {
    count: 1
    kind: KIND_GPU
    gpus: [ 1 ]
  }
]

Clearly I have specified it to use gpu 0 and gpu 1.

The postprocessing, preprocessing and tensorrt_llm are left unchanged.

Expected behavior

The model should be loaded on gpu 0 and gpu 1, and can deal with requests based on load.

Here are what I got:

See the model only loaded on gpu 0.

When I do benchmark, there is also just gpu 0 is used:

Here is the running log of triton:

root@ubuntu22:/opt/tritonserver# triton start
triton - INFO - Starting a Triton Server locally with model repository: /root/models
triton - INFO - Reading server output...
I0927 07:14:19.053887 3017 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x73e104000000' with size 268435456"
I0927 07:14:19.060743 3017 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0927 07:14:19.060759 3017 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864"
I0927 07:14:19.290249 3017 model_lifecycle.cc:472] "loading: llama-3.1-8b-instruct:1"
I0927 07:14:19.290321 3017 model_lifecycle.cc:472] "loading: postprocessing:1"
I0927 07:14:19.290360 3017 model_lifecycle.cc:472] "loading: preprocessing:1"
I0927 07:14:19.290413 3017 model_lifecycle.cc:472] "loading: tensorrt_llm:1"
I0927 07:14:19.511929 3017 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm"
I0927 07:14:19.511965 3017 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19"
I0927 07:14:19.511970 3017 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19"
I0927 07:14:19.511973 3017 libtensorrtllm.cc:86] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
I0927 07:14:19.530974 3017 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to false
[TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search, medusa}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.12.0.dev2024080600 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 1048576
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1048576
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
I0927 07:14:22.499985 3017 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: llama-3.1-8b-instruct_0_0 (GPU device 0)"
I0927 07:14:22.500067 3017 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: llama-3.1-8b-instruct_1_0 (GPU device 1)"
I0927 07:14:23.339154 3017 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)"
I0927 07:14:23.406562 3017 python_be.cc:1923] "TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)"
I0927 07:14:24.257499 3017 model_lifecycle.cc:839] "successfully loaded 'llama-3.1-8b-instruct'"
I0927 07:14:25.953954 3017 model_lifecycle.cc:839] "successfully loaded 'preprocessing'"
I0927 07:14:26.009815 3017 model_lifecycle.cc:839] "successfully loaded 'postprocessing'"
[TensorRT-LLM][INFO] Loaded engine size: 15387 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 800.00 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 15380 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.12 GB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 4.63 GB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 39.50 GiB, available: 12.40 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 1429
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][WARNING] maxAttentionWindow and maxSequenceLen are too large for at least one sequence to fit in kvCache. they are reduced to 91456
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 1429
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 11.16 GiB for max tokens in paged KV cache (91456).
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
I0927 07:14:41.151772 3017 libtensorrtllm.cc:184] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_0_0"
I0927 07:14:41.152077 3017 model_lifecycle.cc:839] "successfully loaded 'tensorrt_llm'"
I0927 07:14:41.152199 3017 server.cc:604]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0927 07:14:41.152246 3017 server.cc:631]
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                                                                                                        |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0927 07:14:41.152284 3017 server.cc:674]
+-----------------------+---------+--------+
| Model                 | Version | Status |
+-----------------------+---------+--------+
| llama-3.1-8b-instruct | 1       | READY  |
| postprocessing        | 1       | READY  |
| preprocessing         | 1       | READY  |
| tensorrt_llm          | 1       | READY  |
+-----------------------+---------+--------+

I0927 07:14:41.246701 3017 metrics.cc:877] "Collecting metrics for GPU 0: NVIDIA A100-PCIE-40GB"
I0927 07:14:41.246738 3017 metrics.cc:877] "Collecting metrics for GPU 1: NVIDIA A100-PCIE-40GB"
I0927 07:14:41.252099 3017 metrics.cc:770] "Collecting CPU metrics"
I0927 07:14:41.252238 3017 tritonserver.cc:2598]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                           |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                                          |
| server_version                   | 2.49.0                                                                                                                                                                                                          |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0]         | /root/models                                                                                                                                                                                                    |
| model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
| strict_model_config              | 0                                                                                                                                                                                                               |
| model_config_name                |                                                                                                                                                                                                                 |
| rate_limit                       | OFF                                                                                                                                                                                                             |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
| cuda_memory_pool_byte_size{1}    | 67108864                                                                                                                                                                                                        |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
| strict_readiness                 | 1                                                                                                                                                                                                               |
| exit_timeout                     | 30                                                                                                                                                                                                              |
| cache_enabled                    | 0                                                                                                                                                                                                               |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0927 07:14:41.254711 3017 grpc_server.cc:2463] "Started GRPCInferenceService at 0.0.0.0:8001"
I0927 07:14:41.254940 3017 http_server.cc:4694] "Started HTTPService at 0.0.0.0:8000"
I0927 07:14:41.296047 3017 http_server.cc:362] "Started Metrics Service at 0.0.0.0:8002"

Sep 27 '24 07:09 gyr66

Hi @gyr66 , thanks for your question. I believe, this is because the trt-llm engine is composed for 1 GPU by default in triton-cli, @rmccorm4 will correct me, if I'm wrong.

Sep 27 '24 17:09 oandreeva-nv

Hi @gyr66, thanks for raising this issue and thanks for trying the Triton CLI!

As Olga mentioned, yes the default configs produced are currently for a "quickstart" path and are pre-defined as a single Triton model instance of KIND_MODEL (it can be a multi-gpu model, but only a single Triton instance). KIND_MODEL denotes to Triton that the backend (TRT-LLM) will handle the device placement/setup as needed, for example loading a TP=2 engine on 2 GPUs in a single Triton model instance.

For multiple model instances, it would require further knowledge and of the TRT-LLM backend, and may not work exactly the same as other backends due to its use of MPI for communication in the current implementation.

There is a guide with more comprehensive details and documentation on the various components involved to serve multiple TRT-LLM model instances, please check it out: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama_multi_instance.md#running-multiple-instances-of-llama-model-on-multiple-gpus.

Hopefully the Triton CLI generated configs give you a good functional starting point for a single instance, and then can be tweaked afterwards by following this guide to support multi-instance.

CC @Tabrizian for viz

Sep 27 '24 18:09 rmccorm4

@gyr66 , let us know if there's anything else we can help you with. Feel free to close this issue

Sep 30 '24 23:09 oandreeva-nv

Thank you so much for your patient and detailed responses! I am wondering, if I don't use TP, could I simply start an independent server process for each GPU and place an NGINX load balancer in front? Would this be consistent with the leader mode?

Oct 02 '24 03:10 gyr66

server server copied to clipboard

When there are multiple GPU, only one GPU is used

server
server copied to clipboard