TensorRT-LLM
TensorRT-LLM copied to clipboard
High memory consumption for `ModelRunnerCpp` combined with `gather_all_token_logits`
System Info
- H100 DGX
- CUDA 12.1
- TensorRT-LLM 0.10.0.dev2024041600
Who can help?
@byshiue
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
When upgrading from TensorRT-LLM version 0.7.1, we noticed that the ModelRunnerCpp
consumes a large amount of memory in combination with gather_all_token_logits
. This was not the case in version 0.7.1 and furthermore this is not the case for the Python runner.
Have a look at the following steps to reproduce our findings:
python convert_checkpoint.py --model_dir ./falcon/7b-instruct --dtype bfloat16 --output_dir ./falcon/7b-instruct/trt_ckpt/bf16/1-gpu/
trtllm-build --checkpoint_dir /falcon/7b-instruct/trt_ckpt/bf16/1-gpu/
--gemm_plugin bfloat16 --remove_input_padding enable --gpt_attention_plugin bfloat16 --output_dir ./falcon/7b-instruct/trt_engines/bf16/1-gpu --max_batch_size 40
python ../summarize.py --test_trt_llm --hf_model_dir ./falcon/7b-instruct --engine_dir ./falcon/7b-instruct/trt_engines/bf16/1-gpu --batch_size 40
In summarize.py
, free_gpu_memory_fraction
had to be reduced and additional it was instrumented in the following way to print device memory:
Index: examples/summarize.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/examples/summarize.py b/examples/summarize.py
--- a/examples/summarize.py (revision 71d8d4d3dc655671f32535d6d2b60cab87f36e87)
+++ b/examples/summarize.py (date 1713771305687)
@@ -199,6 +199,7 @@
output_sequence_lengths=True,
return_dict=True,
medusa_choices=args.medusa_choices)
+ profiler.print_device_memory_usage("After generate")
torch.cuda.synchronize()
# Extract a list of tensors of shape beam_width x output_ids.
@@ -363,8 +364,12 @@
max_output_len=output_len,
max_beam_width=num_beams,
max_attention_window_size=max_attention_window_size,
- sink_token_length=sink_token_length)
+ sink_token_length=sink_token_length,
+ free_gpu_memory_fraction=0.5)
runner = runner_cls.from_dir(**runner_kwargs)
+ profiler.print_device_memory_usage("After runner init")
+ vocab_size = 65024
+ print("Expected logit tensor size:", (test_token_num+output_len) * max_batch_size * vocab_size * 4 / 1024**3, " GiB")
assert not (args.eval_ppl and not (runner.gather_context_logits and runner.gather_generation_logits)), \
"PPL evaluation requires engine built with gather_all_token_logits enabled"
Expected behavior
From build output log
Weights: 13.4 GiB Activations: 7.3 GiB (same in both scenarios)
Without gather_all_token_logits
KV-cache (from log): 28 GiB After init: 50.3 GiB (~ weights + activations + kv = 48.7) After generate: 50.4 GiB
-> this adds up
With gather_all_token_logits
Expected additional memory due to overall logits tensor size (see code above): 9.9 GiB
actual behavior
With gather_all_token_logits
Actual memory consumption:
KV-cache (from log): 28.8 GiB After runner init (our instrumentation): 60.3 GiB -> 9.8 GiB additional memory from C++ runner After generate (our instrumentation): 70.5 GiB -> 10.2 GiB additional memory (torch logits?)
-> this is twice the amount of additional memory than expected
additional notes
Overall our assesment is that the logit tensor seems to be allocated both in the C++ code and additionally in torch:https://github.com/NVIDIA/TensorRT-LLM/blob/71d8d4d3dc655671f32535d6d2b60cab87f36e87/tensorrt_llm/runtime/model_runner_cpp.py#L356-L364
For larger vocab sizes (128k, 256k), high batch sizes or sequence lengths, this memory footprint makes the ModelRunnerCpp
in combination with gather_all_token_logits
almost impossible to use.
I'm seeing this issue as well.
Hi @Marks101 @vnkc1 , thank you for your feedback.
This memory usage is expected. The reason for twice the amount of GPU memory for logits is because:
- The new logits generated in each step are not immediately copied to the output buffer, which results in frequent and fragmented memory copies.
- We will temporarily store these logits and finally copy them to the output buffer. Therefore, the temporary logits and the final output buffer constitute 2 times device memory usage for logits. This memory usage can be optimized.
There are currently two sets of runtimes, and relevant optimizations have been made on the new runtime, so there will be no such additional device memory overhead. Currently, python binding still uses the old runtime, so there will be observed problems.
But because we are migrating to a new runtime, the old runtime will not be optimized anymore.
It is recommended to use HLAPI or refer to gptManagerBenchmark.cpp to call GptManager directly, which could use the new runtime and leverage the optimized implementation.
I see that the High-Level API does not return context logits. @yweng0828