Great work team!

Currently, I am pruning on the llama2-7b-chat-hf model from hugging face.

python main.py

--model NousResearch/Llama-2-7b-chat-hf
--prune_method wanda
--sparsity_ratio 0.5
--sparsity_type 2:4
--save out/llama_7b-chat-hf/structured/wanda/

got this error message:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 21.99 GiB of which 11.69 MiB is free. Including non-PyTorch memory, this process has 21.98 GiB memory in use. Of the allocated memory 20.84 GiB is allocated by PyTorch, and 61.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

Apr 29 '24 19:04 rsong0606

I think you need at least 14GB GPU memory to load the 7b model in fp16.

Apr 30 '24 12:04 Eric-mingjie

@Eric-mingjie Thanks Eric, mine is 24 GB GPU memory. Given that at least 14GB would be used to load the model. I still have ~10 GB left in Nvidia L4. Are there any extra activities taking more memory and can we avoid in the arguments?

Apr 30 '24 14:04 rsong0606

Mine has 80GB of GPU RAM >>>>>NVIDIA A100 (and H100) GPU in Stanage has 80GB of GPU RAM still got this error. torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU

complete error for reference:

torch 2.3.0 transformers 4.41.0.dev0 accelerate 0.31.0.dev0

of gpus: 1

loading llm model mistralai/Mistral-7B-Instruct-v0.2 ^MLoading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]^MLoading checkpoint shards: 33%|███▎ | 1/3 [00:12<00:24, 12.09s/it]^MLoading checkpoint shards: 67%|██████▋ | 2/3 [00:29<00:15,$ use device cuda:0 pruning starts loading calibdation data dataset loading complete Traceback (most recent call last): File "/mnt/parscratch/users/acq22stk/teamproject/wanda/main.py", line 110, in main() File "/mnt/parscratch/users/acq22stk/teamproject/wanda/main.py", line 69, in main prune_wanda(args, model, tokenizer, device, prune_n=prune_n, prune_m=prune_m) File "/mnt/parscratch/users/acq22stk/teamproject/wanda/lib/prune.py", line 160, in prune_wanda outs[j] = layer(inps[j].unsqueeze(0), attention_mask=attention_mask, position_ids=position_ids)[0] File "/users/acq22stk/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/users/acq22stk/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/users/acq22stk/.conda/envs/prune_llm/lib/python3.9/site-packages/transformers/models/mistral/modeling_mistral.py", line 754, in forward hidden_states = self.input_layernorm(hidden_states) File "/users/acq22stk/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/users/acq22stk/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/users/acq22stk/.conda/envs/prune_llm/lib/python3.9/site-packages/transformers/models/mistral/modeling_mistral.py", line 85, in forward hidden_states = hidden_states.to(torch.float32) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU

May 08 '24 19:05 kast424

I have the same error with the Mixtral 8x7B model using 4 A6000 GPUs (48GiB memory per device).

Aug 06 '24 22:08 nehaprakriya

Excuse me, have you solved this problem? I encountered the same issue.😭

Dec 15 '24 07:12 wrsIt

I need an exaggerated 120G! torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 GiB. GPU 3 has a total capacity of 79.11 GiB of which 61.58 GiB is free. Including non-PyTorch memory, this process has 518.00 MiB memory in use. Process 10685 has 17.01 GiB memory in use. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Jan 08 '25 02:01 KangkangStu

I am also facing the same error when trying to prune Llama-3.2-1B. I have ~48GB of vram in A6000. This model only has 1b parameters, still getting torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 GiB. GPU 0 has a total capacity of 47.43 GiB of which 44.77 GiB is free. Including non-PyTorch memory, this process has 2.56 GiB memory in use. Of the allocated memory 2.30 GiB is allocated by PyTorch, and 1.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Feb 19 '25 17:02 shubham-maindola

Any update on this problem of OOV?

Feb 27 '25 13:02 junzhang-zj

From my execution, the failure seems to stem from the prepare_calibration_input. The error I face is as follows:

Traceback (most recent call last):
  File "/home/ec2-user/wanda/main.py", line 110, in <module>
    main()
  File "/home/ec2-user/wanda/main.py", line 69, in main
    prune_wanda(args, model, tokenizer, device, prune_n=prune_n, prune_m=prune_m)
  File "/home/ec2-user/wanda/lib/prune.py", line 135, in prune_wanda
    inps, outs, attention_mask, position_ids = prepare_calibration_input(model, dataloader, device)
  File "/home/ec2-user/wanda/lib/prune.py", line 90, in prepare_calibration_input
    outs = torch.zeros_like(inps)
RuntimeError: CUDA out of memory. Tried to allocate 4.00 GiB (GPU 0; 21.98 GiB total capacity; 16.71 GiB already allocated; 3.56 GiB free; 16.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Considering that the magnitude sparsification worked just fine (but sparsegpt failed), I think the issue stems from how the "c4" model is materialized. Unsure what the specific GPU requirements are at this point.

Apr 20 '25 16:04 dat-adi

Update: The c4 team updated their huggingface model and I can't seem to load the c4 model using these lines in lib/data.py:

    traindata = load_dataset('allenai/c4', 'allenai--c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
    valdata = load_dataset('allenai/c4', 'allenai--c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')

I altered this to load_dataset('allenai/c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train') which is why I was facing this issue. It's no longer loading a subset but the entirety of C4 which is over 300GB.

I'm going to try to retrieve a subset instead.

Apr 20 '25 16:04 dat-adi

Seems like the issue had more to do with materializing a zeros_like object over downloading the C4 dataset. Still unsure why the error pops up, but it might just be a valid OOM error.

Apr 27 '25 06:04 dat-adi

wanda
wanda copied to clipboard

gpu memory size recommended for pruning the llama2-7b-chat-hf model

of gpus: 1

wanda wanda copied to clipboard

gpu memory size recommended for pruning the llama2-7b-chat-hf model

of gpus: 1

wanda
wanda copied to clipboard