AQLM
AQLM copied to clipboard
How long for the quantizing a 70b model? I had ran for 2days
is it toooo long to quantized a model ?
python main.py $MODEL_PATH $DATASET_PATH --nsamples=1024 \ --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 \ --relative_mse_tolerance=0.01 --finetune_relative_mse_tolerance=0.001 \ --finetune_batch_size=32 --local_batch_size=1 --offload_activations \ --wandb --save $SAVE_PATH
Hello! Thank you for your interest in the project. Yes indeed, AQLM quantization takes considerably longer to calibrate than simpler quantization methods such as GPTQ. This only impacts quantization time, not inference time. Quantization depends on your model size, hardware(number of GPUs , GPUs models e.t.c.) and quantization parameters. I added more details on quantization time in ReadME. Hope this helps. If you have any additional questions, please feel free to ask.
could you share a example script for quantizing a 70b model on 8*A100 ?
Hi!
Hope this helps:
WANDB_PROJECT="wandb_project" WANDB_NAME="wandb_name" HF_HOME="/mnt/LLM" CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 OMP_NUM_THREADS=16 MKL_NUM_THREADS=16 python main.py meta-llama/Llama-2-70b-hf "pajama" --relative_mse_tolerance=0.01 --finetune_relative_mse_tolerance=0.001 --nsamples=2048 --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 --finetune_batch_size=32 --local_batch_size=2 --wandb --save="path_to_save"
If you want farther improve ppl, you can additionally run global fine-tuning after you obtained quantized model see https://github.com/Vahe1994/AQLM/pull/50 for the code and see https://github.com/Vahe1994/AQLM/issues/49 for example how to run it.
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.