llama3
llama3 copied to clipboard
Meta-Llama-3-70B-Instruct running out of memory on 8 A100-40GB
Describe the bug
Out of memory. Tried to allocate X.XX GiB .....
Minimal reproducible example
I guess any A100 system with 8+ GPUs
python example_chat_completion.py
Output
<Remember to wrap the output in ```triple-quotes blocks```
>
Out of memory. Tried to allocate X.XX GiB .....
Runtime Environment
- Model: Meta-Llama-3-70B-Instruct
- Using via huggingface?: no
- OS: Linux
- GPU VRAM: 40 GB
- Number of GPUs: 8
- GPU Make: Nvidia
Additional context Is there a way to reduce the memory requirement ? Most obvious trick, reducing batch size, did not prevent OOM.
What is the best way to adapt the 8 checkpoints for A100-80GB/H100 for the 70B model to say 16 A100-40GB ?
Please see this thread: https://github.com/meta-llama/llama3/issues/157#issuecomment-2110497041
https://build.nvidia.com/meta/llama3-70b?snippet_tab=Python
@subramen , looks like there are more fundamental issues in adapting the 8 GPU checkpoint to any number higher than 8 . See the following.
self.n_kv_heads = args.n_heads if args.n_kv_heads is None else args.n_kv_heads model_parallel_size = fs_init.get_model_parallel_world_size() self.n_local_heads = args.n_heads // model_parallel_size self.n_local_kv_heads = self.n_kv_heads // model_parallel_size self.n_rep = self.n_local_heads // self.n_local_kv_heads self.head_dim = args.dim // args.n_heads
https://github.com/meta-llama/llama3/blob/11817d47e1ba7a4959b025eb1ca308572e0e3963/llama/model.py#L93C1-L98C49