llama3 icon indicating copy to clipboard operation
llama3 copied to clipboard

Meta-Llama-3-70B-Instruct running out of memory on 8 A100-40GB

Open whatdhack opened this issue 9 months ago • 4 comments

Describe the bug

Out of memory. Tried to allocate X.XX GiB .....

Minimal reproducible example

I guess any A100 system with 8+ GPUs

python example_chat_completion.py

Output

<Remember to wrap the output in ```triple-quotes blocks```>

Out of memory. Tried to allocate X.XX GiB  .....

Runtime Environment

  • Model: Meta-Llama-3-70B-Instruct
  • Using via huggingface?: no
  • OS: Linux
  • GPU VRAM: 40 GB
  • Number of GPUs: 8
  • GPU Make: Nvidia

Additional context Is there a way to reduce the memory requirement ? Most obvious trick, reducing batch size, did not prevent OOM.

whatdhack avatar May 03 '24 00:05 whatdhack

What is the best way to adapt the 8 checkpoints for A100-80GB/H100 for the 70B model to say 16 A100-40GB ?

whatdhack avatar May 11 '24 17:05 whatdhack

Please see this thread: https://github.com/meta-llama/llama3/issues/157#issuecomment-2110497041

subramen avatar May 15 '24 16:05 subramen

https://build.nvidia.com/meta/llama3-70b?snippet_tab=Python

dirtycomputer avatar Jul 19 '24 02:07 dirtycomputer

@subramen , looks like there are more fundamental issues in adapting the 8 GPU checkpoint to any number higher than 8 . See the following.

self.n_kv_heads = args.n_heads if args.n_kv_heads is None else args.n_kv_heads model_parallel_size = fs_init.get_model_parallel_world_size() self.n_local_heads = args.n_heads // model_parallel_size self.n_local_kv_heads = self.n_kv_heads // model_parallel_size self.n_rep = self.n_local_heads // self.n_local_kv_heads self.head_dim = args.dim // args.n_heads https://github.com/meta-llama/llama3/blob/11817d47e1ba7a4959b025eb1ca308572e0e3963/llama/model.py#L93C1-L98C49

whatdhack avatar Aug 12 '24 16:08 whatdhack