SpQR
SpQR copied to clipboard
CUDA out of memory falcon-40b when using 40Gi A100 GPU
Been trying to run quantization for falcon-40b on a box with 8 40Gi A100's but I keep getting CUDA memory errors. The readme states that this should be possible, unless I'm misreading this line:
It may successfully run on GPUs with 32 - 40GB for perplexity evaluation of up to LLaMA-65B and Falcon-40B models.
Here's the command I'm running
python main.py falcon_model/models--tiiuae--falcon-40b/snapshots/c47b371b31a68349c233104050ac76680b8485db custom \
--custom_data_path=data/refined_web_n=128.pth \
--wbits 4 \
--groupsize 16 \
--perchannel \
--qq_scale_bits 3 \
--qq_zero_bits 3 \
--qq_groupsize 16 \
--outlier_threshold=0.2 \
--permutation_order act_order \
--percdamp 1e0 \
--nsamples 128
Here's the full command output:
/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/computation/expressions.py:20: UserWarning: Pandas requires version '2.7.3' or newer of 'numexpr' (version '2.7.1' currently installed).
from pandas.core.computation.check import NUMEXPR_INSTALLED
============ Loading model... ============
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:
Local host: ubuntu
Device name: mlx5_0
Device vendor ID: 0x02c9
Device vendor part ID: 4122
Default device parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: ubuntu
Local device: mlx5_0
Local port: 1
CPCs attempted: udcm
--------------------------------------------------------------------------
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:47<00:00, 5.23s/it]
============ Quantizing model... ============
Loading data ...
Starting SPQR quantization ...
catching inputs from data
---------------- Layer 0 of 60 ----------------
layer_dev_original=device(type='cpu')
Quantizing module self_attention.query_key_value of layer 0
Quantizing module self_attention.dense of layer 0
Quantizing module mlp.dense_h_to_4h of layer 0
Quantizing module mlp.dense_4h_to_h of layer 0
Traceback (most recent call last):
File "main.py", line 549, in <module>
quantize_model(model, args, device)
File "main.py", line 73, in quantize_model
results = quantize_spqr(model, dataloader, args, device)
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "main.py", line 217, in quantize_spqr
quantized = spqr_handlers[sublayer_name].quantize(
File "/home/ubuntu/SpQR/spqr_engine.py", line 84, in quantize
H = H[perm][:, perm]
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB (GPU 0; 39.56 GiB total capacity; 33.54 GiB already allocated; 2.80 GiB free; 35.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Is there something I'm doing wrong when launching the command?
Hello, @caleb-artifact, and thank you for interest to SpQR quantization!
Most likely you encountered excessive memory usage error that was fixed by now. I just re-tested it today. With PR #25 merged the model is compatible with 40B model_type parameter. Make sure that you have latest main branch.
Try again, please, and see if it works on your machine. Try adding arguments --offload_activations --skip_out_loss to further reduce memory usage.