wanda
wanda copied to clipboard
gpu memory size recommended for pruning the llama2-7b-chat-hf model
Great work team!
Currently, I am pruning on the llama2-7b-chat-hf model from hugging face.
python main.py
--model NousResearch/Llama-2-7b-chat-hf
--prune_method wanda
--sparsity_ratio 0.5
--sparsity_type 2:4
--save out/llama_7b-chat-hf/structured/wanda/
got this error message:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 21.99 GiB of which 11.69 MiB is free. Including non-PyTorch memory, this process has 21.98 GiB memory in use. Of the allocated memory 20.84 GiB is allocated by PyTorch, and 61.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
My GPU specs are below +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA L4 On | 00000000:00:03.0 Off | 0 | | N/A 52C P8 17W / 72W | 0MiB / 23034MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
I think you need at least 14GB GPU memory to load the 7b model in fp16.
@Eric-mingjie Thanks Eric, mine is 24 GB GPU memory. Given that at least 14GB would be used to load the model. I still have ~10 GB left in Nvidia L4. Are there any extra activities taking more memory and can we avoid in the arguments?
Mine has 80GB of GPU RAM >>>>>NVIDIA A100 (and H100) GPU in Stanage has 80GB of GPU RAM still got this error. torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU
complete error for reference:
torch 2.3.0 transformers 4.41.0.dev0 accelerate 0.31.0.dev0
of gpus: 1
loading llm model mistralai/Mistral-7B-Instruct-v0.2
^MLoading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]^MLoading checkpoint shards: 33%|███▎ | 1/3 [00:12<00:24, 12.09s/it]^MLoading checkpoint shards: 67%|██████▋ | 2/3 [00:29<00:15,$
use device cuda:0
pruning starts
loading calibdation data
dataset loading complete
Traceback (most recent call last):
File "/mnt/parscratch/users/acq22stk/teamproject/wanda/main.py", line 110, in
I have the same error with the Mixtral 8x7B model using 4 A6000 GPUs (48GiB memory per device).
Excuse me, have you solved this problem? I encountered the same issue.😭
I need an exaggerated 120G! torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 GiB. GPU 3 has a total capacity of 79.11 GiB of which 61.58 GiB is free. Including non-PyTorch memory, this process has 518.00 MiB memory in use. Process 10685 has 17.01 GiB memory in use. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
I am also facing the same error when trying to prune Llama-3.2-1B. I have ~48GB of vram in A6000. This model only has 1b parameters, still getting
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 GiB. GPU 0 has a total capacity of 47.43 GiB of which 44.77 GiB is free. Including non-PyTorch memory, this process has 2.56 GiB memory in use. Of the allocated memory 2.30 GiB is allocated by PyTorch, and 1.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Any update on this problem of OOV?
From my execution, the failure seems to stem from the prepare_calibration_input. The error I face is as follows:
Traceback (most recent call last):
File "/home/ec2-user/wanda/main.py", line 110, in <module>
main()
File "/home/ec2-user/wanda/main.py", line 69, in main
prune_wanda(args, model, tokenizer, device, prune_n=prune_n, prune_m=prune_m)
File "/home/ec2-user/wanda/lib/prune.py", line 135, in prune_wanda
inps, outs, attention_mask, position_ids = prepare_calibration_input(model, dataloader, device)
File "/home/ec2-user/wanda/lib/prune.py", line 90, in prepare_calibration_input
outs = torch.zeros_like(inps)
RuntimeError: CUDA out of memory. Tried to allocate 4.00 GiB (GPU 0; 21.98 GiB total capacity; 16.71 GiB already allocated; 3.56 GiB free; 16.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Considering that the magnitude sparsification worked just fine (but sparsegpt failed), I think the issue stems from how the "c4" model is materialized. Unsure what the specific GPU requirements are at this point.
Update: The c4 team updated their huggingface model and I can't seem to load the c4 model using these lines in lib/data.py:
traindata = load_dataset('allenai/c4', 'allenai--c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
valdata = load_dataset('allenai/c4', 'allenai--c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')
I altered this to load_dataset('allenai/c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train') which is why I was facing this issue. It's no longer loading a subset but the entirety of C4 which is over 300GB.
I'm going to try to retrieve a subset instead.
Seems like the issue had more to do with materializing a zeros_like object over downloading the C4 dataset. Still unsure why the error pops up, but it might just be a valid OOM error.