TensorRT-LLM How to control out of memory error with PYTORCH_CUDA_ALLOC

I am using quantize.py according to GPTJ inference guide and the command is

python examples/quantization/quantize.py \
    --dtype=float16  \
    --output_dir=./model/GPTJ-6B/fp8-quantized-ammo/GPTJ-FP8-quantized \
    --model_dir=./model/GPTJ-6B/checkpoint-final/ \
    --qformat=fp8 --kv_cache_dtype=fp8

That command however fails with an out-of-memory error message:

Calibrating batch 510
Calibrating batch 511
Quantization done. Total time used: 1068.72 s.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
Cannot export model to the model_config. The AMMO optimized model state_dict (including the quantization factors) is saved to model/GPTJ-6B/fp8-quantized-ammo/GPTJ-FP8-quantized/ammo_model.0.pth using torch.save for further inspection.
Detailed export error: CUDA out of memory. Tried to allocate 128.00 MiB. GPU 0 has a total capacity of 9.77 GiB of which 36.06 MiB is free. Process 230131 has 9.56 GiB memory in use. Of the allocated memory 9.29 GiB is allocated by PyTorch, and 18.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/home/mnaderan/.local/lib/python3.10/site-packages/ammo/torch/export/model_config_export.py", line 307, in export_model_config
    for model_config in torch_to_model_config(
  File "/home/mnaderan/.local/lib/python3.10/site-packages/ammo/torch/export/model_config_export.py", line 185, in torch_to_model_config
    build_decoder_config(layer, model_metadata_config, decoder_type, dtype)
  File "/home/mnaderan/.local/lib/python3.10/site-packages/ammo/torch/export/layer_utils.py", line 944, in build_decoder_config
    config.mlp = build_mlp_config(layer, decoder_type, dtype)
  File "/home/mnaderan/.local/lib/python3.10/site-packages/ammo/torch/export/layer_utils.py", line 764, in build_mlp_config
    config.fc = build_linear_config(layer, LINEAR_COLUMN, dtype)
  File "/home/mnaderan/.local/lib/python3.10/site-packages/ammo/torch/export/layer_utils.py", line 591, in build_linear_config
    weight = torch_weight.type(dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB. GPU 0 has a total capacity of 9.77 GiB of which 36.06 MiB is free. Process 230131 has 9.56 GiB memory in use. Of the allocated memory 9.29 GiB is allocated by PyTorch, and 18.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Quantized model exported to ./model/GPTJ-6B/fp8-quantized-ammo/GPTJ-FP8-quantized

I searched for PYTORCH_CUDA_ALLOC_CONF to see how to use that. I tried different values and even when I use 32 (minimum is 20) with export 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32' prior to running the Python command, I still get the same error.

I have one RTX3080 device with 10GB of memory. Any idea on how to fix that? I don't know if that is a TensortRT-LLM issue or Pytorch issue, so any idea would greatly help.

Jul 17 '24 06:07 mahmoodn

Hi @mahmoodn , the 10GB device memory is not enough to quantize the GPT-J model. Please refer to the answer of a similar issue: https://github.com/NVIDIA/TensorRT-LLM/issues/1932#issuecomment-2227560712

Jul 17 '24 08:07 QiJune

Thanks for the reply. Unfortunately, I don't have access to A100 (Ampere). If there is no option to create smaller chunks (if any) in order to reduce the GPU memory at a given time, then that is bad...

Jul 17 '24 11:07 mahmoodn

Or, I would like to ask if there are quantized files publicly available for those who don't have the computing resource?

Jul 17 '24 12:07 mahmoodn

Hi @mahmoodn , we do have plans to upload pre-quantized weights to HF model hub in the future.

Aug 04 '24 13:08 QiJune

How to control out of memory error with PYTORCH_CUDA_ALLOC_CONF?