TensorRT-LLM Mixtral engine build gives CUDA OOM on 8 40GB GPUs (0.8.0 release)

System Info

p4d with 8 GPUs : NVIDIA A100 40GB x 8

package version tensorrt-9.2.0.post12.dev5-cp310-none-linux_x86_64.whl [TensorRT-LLM] TensorRT-LLM version: 0.8.00.8.0

Who can help?

@byshiue

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Installation python -m pip install tensorrt_llm==0.8.0 --extra-index-url https://pypi.nvidia.com
Create checkpoint python ./examples/llama/convert_checkpoint.py --model_dir ~/Mixtral-8x7B-Instruct-v0.1 --output_dir ~/checkpoints/Mixtral-8x7B-Instruct-v0.1/bf16-tp8 --dtype bfloat16 --tp_size 8 --workers 8

Expected behavior

Successful checkpoint creation

actual behavior

CUDA Out-Of-Memory on 1 out of 8 GPUs

additional notes

I believe a 56B model should comfortably compile on 8 40GB GPUs; may I have some info as to why this is occurring and how to estimate GPU memory required to build a Mixtral engine?

Mar 14 '24 19:03 vnkc1

OOM occurs at: https://github.com/NVIDIA/TensorRT-LLM/blob/v0.8.0/examples/llama/convert_checkpoint.py#L960

torch.concat([w3, w1], dim=-2) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 1 has a total capacty of 39.39 GiB of which 98.38 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 38.75 GiB is allocated by PyTorch, and 61.61 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Mar 15 '24 18:03 ghost

see #1156

Mar 17 '24 11:03 nivibilla

@mickaelseznec, here's the reproduction on 8xA100 (40GB) using the 0.9.0 release

$ python examples/llama/convert_checkpoint.py --model_dir ./Mixtral-8x22B-Instruct-v0.1 --output_dir ./ckpt --dtype float16 --tp_size 8

[TensorRT-LLM] TensorRT-LLM version: 0.9.0 0.9.0 Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 59/59 [12:56<00:00, 13.16s/it] Traceback (most recent call last): in main() in main convert_and_save_hf(args) in convert_and_save_hf execute(args.workers, [convert_and_save_rank] * world_size, args) in execute f(args, rank) in convert_and_save_rank llama = LLaMAForCausalLM.from_hugging_face( in from_hugging_face llama = convert.from_hugging_face( in from_hugging_face weights = load_weights_from_hf(config=config, in load_weights_from_hf weights = convert_hf_llama( in convert_hf_llama convert_layer(l) in convert_layer f'model.layers.{l}.block_sparse_moe.experts.w3w1.weight'] = torch.concat(

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 384.00 MiB. GPU 0 has a total capacity of 39.39 GiB of which 134.81 MiB is free. Process 33539 has 0 bytes memory in use. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 38.31 GiB is allocated by PyTorch, and 61.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expand

May 02 '24 21:05 ghost

@vnkc1 can you try the solution of switching device map from auto to cpu as suggested here https://github.com/NVIDIA/TensorRT-LLM/issues/1440

May 09 '24 06:05 djns99

@djns99 I cannot load model onto CPU as I will be running quantization calibration.

May 09 '24 16:05 ghost

I'm not sure I understand how that prevents you loading on the CPU? If you are quantizing to FP8 (Hopper only) you should be using quantize.py. If you are quantizing int8, only symmetric weight only quantize is currently supported, and that quantization runs on the CPU (and does not require calibration)

May 09 '24 21:05 djns99