GPU OOM Error When Quantizing Llama 3 8b
System Info
- CPU architecture: x86_64
- CPU/Host memory size: 64 GB (AWS g5.4xlarge)
- GPU properties:
- GPU name: NVIDIA A10G
- GPU memory size: 24 GB
- Libraries: TensorRT-LLM
- TensorRT-LLM version: v0.9 (tag)
- Container: nvidia/cuda:12.1.0-devel-ubuntu22.04
- NVIDIA driver version: 12.2
- OS: Ubuntu
- Additional information:
- Instance type: AWS g5.4xlarge
- Model being quantized: Llama-3-8b-chat-hf-instruct
- Quantization format: w4a8_awq (4-bit weights, 8-bit activations using AWQ)
- Max sequence length used: 128
Who can help?
@Tracin @byshiue
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
-
Set up the environment:
- Use an AWS g5.4xlarge instance with an NVIDIA A10G GPU (24GB VRAM)
- Install TensorRT-LLM v0.9 and its dependencies
-
Prepare the model:
- Download or prepare the Llama-3-8b-chat-hf model snapshots
-
Run the quantization script:
python3 ../quantization/quantize.py \ --model_dir /path/to/Llama-3-8b-chat-hf/snapshots \ --dtype float16 \ --qformat w4a8_awq \ --max_seq_length 128 \ --output_dir /path/to/output/Llama-3-8b-chat-hf-quantised-int4-fp8-awq \ --device coda
Expected behavior
The model exported.
actual behavior
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 21.99 GiB of which 17.38 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 21.64 GiB is allocated by PyTorch, and 29.40 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
additional notes
n/a a
can you try tensorrt llm v0.10.0?
how did you convert the model checkpoint to tensorrt format using convert_checkpoint.py?
also as far i know NVIDIA A10G does not support fp8?
The reason why I stick around v0.9.0 is that the latest triton server image is still using that version: https://github.com/triton-inference-server/server/releases/tag/v2.47.0 @geraldstanje
Hi @ngockhanh5110 ,
The original HF LLaMA 8B checkpoint is about 16G, the w4a8_awq quantized checkpoint is about 4G. And there are also some intermediate memory consumption when doing quantization. So, the A10 GPU with 24GB is not enough to do quantization.
You can try to quantize the model with A100, while build and run the quantized checkpoint in A10
hi @QiJune, I will give it a try. Thank you.
I thought the whole process should be in the same GPU.