TensorRT-LLM
TensorRT-LLM copied to clipboard
Mixtral convertation OOM Fix
System Info
System Info
- CPU Architecture - x86_64
- CPU/Host memory size - 330gb from
/proc/meminfo
- GPU name - NVIDIA H100 80GB HBM3
- GPU memory size - 81559MiB
- TensorRT-LLM branch - v0.9.0
- Container - building official container.
Who can help?
@byshiue
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
tl;dr - Mixtral quantization fails on 2xH100 80gb gpus and a propose a small fix for it into nvidia-ammo.
Hi! I've been trying to convert mixtral in fp8 using the latest version of TensorRT-LLM but I had an OOM error:
Cannot export model to the model_config. The AMMO optimized model state_dict (including the quantization factors) is saved to tllm_checkpoint_mixtral_2gpu/ammo_model.0.pth using torch.save
for further inspection.
Detailed export error: CUDA out of memory. Tried to allocate 1.75 GiB. GPU 0 has a total capacity of 79.11 GiB of which 166.62 MiB is free. Process 970747 has 78.93 GiB memory in use. Of th
e allocated memory 78.15 GiB is allocated by PyTorch, and 132.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=
expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 332, in export_tensorrt_llm_checkpoint
for tensorrt_llm_config, weights in torch_to_tensorrt_llm_checkpoint(
File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 204, in torch_to_tensorrt_llm_checkpoint
build_decoder_config(layer, model_metadata_config, decoder_type, dtype)
File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 1149, in build_decoder_config
config.mlp = build_moe_config(layer, decoder_type, dtype)
File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 978, in build_moe_config
experts.fc, experts.proj = build_stacked_experts(module.experts, dtype)
File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 892, in build_stacked_experts
experts_weight_1.weight = torch.concat(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.75 GiB. GPU 0 has a total capacity of 79.11 GiB of which 166.62 MiB is free. Process 970747 has 78.93 GiB memory in use.
Of the allocated memory 78.15 GiB is allocated by PyTorch, and 132.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC
_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
File "/app/tensorrt_llm/examples/quantization/quantize.py", line 52, in <module>
quantize_and_export(model_dir=args.model_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_ammo.py", line 335, in quantize_and_export
with open(f"{export_path}/config.json", "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: './tllm_checkpoint_mixtral_2gpu/config.json'
I'm launching the convertation using the official script from here:
# Quantize HF Mixtral into FP8 and export trtllm checkpoint
python ../quantization/quantize.py --model_dir ./Mixtral-8x7B-v0.1 \
--dtype float16 \
--qformat fp8 \
--kv_cache_dtype fp8 \
--output_dir ./tllm_checkpoint_mixtral_2gpu \
--calib_size 512 \
--tp_size 2
# Build trtllm engines from the trtllm checkpoint
# Enable fp8 context fmha to get further acceleration by setting `--use_fp8_context_fmha enable`
trtllm-build --checkpoint_dir ./tllm_checkpoint_mixtral_2gpu \
--output_dir ./engine_outputs \
--gemm_plugin float16 \
--strongly_typed \
--workers 2
I've had 2 H100 gpus with 81gbs of memory and the first gpu had 80gb of memory allocated and the second had 40gb and the convertation failed.
Expected behavior
Successful model convertation
actual behavior
OOM Failure
additional notes
I've managed to fix it by going deep into nvidia-ammo. My version is:
Name: nvidia-ammo
Version: 0.9.3
I've found that in the file ammo/torch/export/layer_utils.py
in function _build_stacked_linear
tensor concatenation results in OOM so I fixed it by moving tensors on the cpu
def _build_stacked_linear(experts: nn.Module, module_name, linear_type, dtype):
config = LinearConfig(linear_type=linear_type)
first_module = getattr(experts[0], module_name)
# weights
config.weight = torch.stack(
[getattr(e, module_name).weight.detach().type(dtype).cpu() for e in experts] # <-- added `cpu()` here
)
And it fixed the problem and my gpus both had 46gb used.
I do not have any access to nvidia-ammo codebase, hopefully this helps everyone running into this issue and maybe some sort of .cpu()
fix can be merged into nvidia-ammo for the next release?
i am doing same fp8 quantization but on llama-2 34b model and using 4xH100 and i am facing the same issue as well
Saw similar issue with A10G + llama3 8B. Also solved this issue with similar tricks by manually edit the source code in ammo to move tensors from gpu to cpu.
trt-llm will add the --device
knob in coming release, then you can specify the --device cpu
to avoid such oom issues.