TensorRT-LLM High memory consumption for `ModelRunnerCpp` combined with `gather_all_token

System Info

H100 DGX
CUDA 12.1
TensorRT-LLM 0.10.0.dev2024041600

Who can help?

@byshiue

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

When upgrading from TensorRT-LLM version 0.7.1, we noticed that the ModelRunnerCpp consumes a large amount of memory in combination with gather_all_token_logits. This was not the case in version 0.7.1 and furthermore this is not the case for the Python runner.

Have a look at the following steps to reproduce our findings:

python convert_checkpoint.py --model_dir ./falcon/7b-instruct --dtype bfloat16 --output_dir ./falcon/7b-instruct/trt_ckpt/bf16/1-gpu/


trtllm-build --checkpoint_dir /falcon/7b-instruct/trt_ckpt/bf16/1-gpu/
 --gemm_plugin bfloat16 --remove_input_padding enable --gpt_attention_plugin bfloat16 --output_dir ./falcon/7b-instruct/trt_engines/bf16/1-gpu --max_batch_size 40

 python ../summarize.py --test_trt_llm --hf_model_dir ./falcon/7b-instruct --engine_dir ./falcon/7b-instruct/trt_engines/bf16/1-gpu --batch_size 40

In summarize.py, free_gpu_memory_fraction had to be reduced and additional it was instrumented in the following way to print device memory:

Index: examples/summarize.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/examples/summarize.py b/examples/summarize.py
--- a/examples/summarize.py	(revision 71d8d4d3dc655671f32535d6d2b60cab87f36e87)
+++ b/examples/summarize.py	(date 1713771305687)
@@ -199,6 +199,7 @@
                 output_sequence_lengths=True,
                 return_dict=True,
                 medusa_choices=args.medusa_choices)
+            profiler.print_device_memory_usage("After generate")
             torch.cuda.synchronize()
 
         # Extract a list of tensors of shape beam_width x output_ids.
@@ -363,8 +364,12 @@
                 max_output_len=output_len,
                 max_beam_width=num_beams,
                 max_attention_window_size=max_attention_window_size,
-                sink_token_length=sink_token_length)
+                sink_token_length=sink_token_length,
+                free_gpu_memory_fraction=0.5)
         runner = runner_cls.from_dir(**runner_kwargs)
+        profiler.print_device_memory_usage("After runner init")
+        vocab_size = 65024
+        print("Expected logit tensor size:", (test_token_num+output_len) * max_batch_size * vocab_size * 4 / 1024**3, " GiB")
         assert not (args.eval_ppl and not (runner.gather_context_logits and runner.gather_generation_logits)), \
             "PPL evaluation requires engine built with gather_all_token_logits enabled"

Expected behavior

From build output log

Weights: 13.4 GiB Activations: 7.3 GiB (same in both scenarios)

Without gather_all_token_logits

KV-cache (from log): 28 GiB After init: 50.3 GiB (~ weights + activations + kv = 48.7) After generate: 50.4 GiB

-> this adds up

With gather_all_token_logits

Expected additional memory due to overall logits tensor size (see code above): 9.9 GiB

actual behavior

With gather_all_token_logits

Actual memory consumption:

KV-cache (from log): 28.8 GiB After runner init (our instrumentation): 60.3 GiB -> 9.8 GiB additional memory from C++ runner After generate (our instrumentation): 70.5 GiB -> 10.2 GiB additional memory (torch logits?)

-> this is twice the amount of additional memory than expected

additional notes

Overall our assesment is that the logit tensor seems to be allocated both in the C++ code and additionally in torch:https://github.com/NVIDIA/TensorRT-LLM/blob/71d8d4d3dc655671f32535d6d2b60cab87f36e87/tensorrt_llm/runtime/model_runner_cpp.py#L356-L364

For larger vocab sizes (128k, 256k), high batch sizes or sequence lengths, this memory footprint makes the ModelRunnerCpp in combination with gather_all_token_logits almost impossible to use.

Apr 22 '24 08:04 Marks101

I'm seeing this issue as well.

Apr 23 '24 05:04 ghost

Hi @Marks101 @vnkc1 , thank you for your feedback.

This memory usage is expected. The reason for twice the amount of GPU memory for logits is because:

The new logits generated in each step are not immediately copied to the output buffer, which results in frequent and fragmented memory copies.
We will temporarily store these logits and finally copy them to the output buffer. Therefore, the temporary logits and the final output buffer constitute 2 times device memory usage for logits. This memory usage can be optimized.

There are currently two sets of runtimes, and relevant optimizations have been made on the new runtime, so there will be no such additional device memory overhead. Currently, python binding still uses the old runtime, so there will be observed problems.

But because we are migrating to a new runtime, the old runtime will not be optimized anymore.

It is recommended to use HLAPI or refer to gptManagerBenchmark.cpp to call GptManager directly, which could use the new runtime and leverage the optimized implementation.

Apr 24 '24 07:04 yweng0828

I see that the High-Level API does not return context logits. @yweng0828

Apr 29 '24 17:04 ghost

TensorRT-LLM
TensorRT-LLM copied to clipboard

High memory consumption for `ModelRunnerCpp` combined with `gather_all_token_logits`

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

TensorRT-LLM TensorRT-LLM copied to clipboard

High memory consumption for `ModelRunnerCpp` combined with `gather_all_token_logits`

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

TensorRT-LLM
TensorRT-LLM copied to clipboard