TensorRT-LLM torch.OutOfMemoryError when try to build TensorRT engines for Qwen2-72B(-Instruct)

Greetings, everyone.

Our hardware configuration is one GPU server with 4xA30(24GB), ubuntu server OS, as well as some general server CPU and 512GB+ memory.
We are now attempting to convert the Qwen2-72B model according to the official tutorial at https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/qwen/README.md.
The problem encountered is: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 384.00 MiB. GPU 0 has a total capacity of 23.50 GiB of which 38.88 MiB is free. Process 2225470 has 17.96 GiB memory in use. Process 997300 has 2.92 GiB memory in use. Including non-PyTorch memory, this process has 2.54 GiB memory in use. Of the allocated memory 2.32 GiB is allocated by PyTorch, and 1.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
My speculation: During the model compilation phase, in order to perform quantization and topology optimization (such as CBR fusion), TRTLLM needs to load the entire model into a single GPU's VRAM to run an inference process, collect model data, and perform global analysis. Obviously, an A30 with 24GB cannot support the VRAM requirements of Qwen 70B, resulting in the above error.

So could you please tell me whether my speculation is right or not? If I am right, is it right that the only way to build the TRTLLM engine file for the large models like Qwen2-72B is to find a GPU whose VRAM is large enough to load the whole model? Or does there exist some approach we can build the TRTLLM engine file using our 4xA30 host?

Thank you so much for any information or hint.

Aug 16 '24 17:08 sdecoder

Hi, @sdecoder Could you try to use --load_model_on_cpu?

Aug 19 '24 08:08 Kefeng-Duan

Hi, @sdecoder Could you try to use --load_model_on_cpu?

Thank you so much for providing such precious hint! I will give it a try immediately. Hopefully it will work. :D

Aug 19 '24 16:08 sdecoder

Hi, @sdecoder Could you try to use --load_model_on_cpu?

Thank you so much! It works! Also I am really curious to know if we can use CPU to build the tensor plan/engine file? Considering the following example:

trtllm-build --checkpoint_dir ./tmp/baichuan_v1_13b/trt_ckpts/fp16/1-gpu/ \
             --output_dir ./tmp/baichuan_v1_13b/trt_engines/fp16/1-gpu/ \
             --gemm_plugin float16 \
             --max_batch_size=32 \
             --max_input_len=1024 \
             --max_seq_len=1536

Is it possible to add some argument like "--use-cpu" to use CPU? I also have tried trtllm-build --help, but found nothing related.

Thanks for any answer/help!

Aug 20 '24 01:08 sdecoder

@sdecoder Do you mean the weights are too big to be stored in on GPU (26GB > 24GB), so you need to offload some (or all ) weights to CPU? if so, please try weight streaming feature: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/weight-streaming.md

Aug 20 '24 01:08 Kefeng-Duan

@sdecoder Do you mean the weights are too big to be stored in on GPU (26GB > 24GB), so you need to offload some (or all ) weights to CPU? if so, please try weight streaming feature: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/weight-streaming.md

That is exactly what I need. Thank you for pointing out this for me. :) This issue can be closed now.

Aug 20 '24 08:08 sdecoder