TensorRT-LLM GPU OOM Error When Quantizing Llama 3 8b

System Info

CPU architecture: x86_64
CPU/Host memory size: 64 GB (AWS g5.4xlarge)
GPU properties:
- GPU name: NVIDIA A10G
- GPU memory size: 24 GB
Libraries: TensorRT-LLM
TensorRT-LLM version: v0.9 (tag)
Container: nvidia/cuda:12.1.0-devel-ubuntu22.04
NVIDIA driver version: 12.2
OS: Ubuntu
Additional information:
- Instance type: AWS g5.4xlarge
- Model being quantized: Llama-3-8b-chat-hf-instruct
- Quantization format: w4a8_awq (4-bit weights, 8-bit activations using AWQ)
- Max sequence length used: 128

Who can help?

@Tracin @byshiue

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Set up the environment:
- Use an AWS g5.4xlarge instance with an NVIDIA A10G GPU (24GB VRAM)
- Install TensorRT-LLM v0.9 and its dependencies
Prepare the model:
- Download or prepare the Llama-3-8b-chat-hf model snapshots

Run the quantization script:

python3 ../quantization/quantize.py \
  --model_dir /path/to/Llama-3-8b-chat-hf/snapshots \
  --dtype float16 \
  --qformat w4a8_awq \
  --max_seq_length 128 \
  --output_dir /path/to/output/Llama-3-8b-chat-hf-quantised-int4-fp8-awq \
  --device coda

Expected behavior

The model exported.

actual behavior

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 21.99 GiB of which 17.38 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 21.64 GiB is allocated by PyTorch, and 29.40 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

additional notes

n/a a

Jul 11 '24 04:07 ngockhanh5110

can you try tensorrt llm v0.10.0?

how did you convert the model checkpoint to tensorrt format using convert_checkpoint.py?

also as far i know NVIDIA A10G does not support fp8?

Jul 11 '24 20:07 geraldstanje

The reason why I stick around v0.9.0 is that the latest triton server image is still using that version: https://github.com/triton-inference-server/server/releases/tag/v2.47.0 @geraldstanje

Jul 11 '24 23:07 ngockhanh5110

Hi @ngockhanh5110 ,

The original HF LLaMA 8B checkpoint is about 16G, the w4a8_awq quantized checkpoint is about 4G. And there are also some intermediate memory consumption when doing quantization. So, the A10 GPU with 24GB is not enough to do quantization.

You can try to quantize the model with A100, while build and run the quantized checkpoint in A10

Jul 15 '24 01:07 QiJune

hi @QiJune, I will give it a try. Thank you.

I thought the whole process should be in the same GPU.

Jul 18 '24 03:07 ngockhanh5110