Quantization Time
How long is the expected time to quantize a 7b mistral model ?
Hello! On 2 GPU it would take approximately 6 –10 hours, it's depends on hyperparameters.
I've tried to quantize another variant of mistral last week, but I was working on a layer 0 for like 4 hours using 8 x 4090 NVLINKed EPYC class server, so I aborted it due to the projected costs.
Is it OK to do it this slow with that mentioned configuration? Just trying to understand the optimal way of doing this.
Hello! On 2 GPU it would take approximately 6 –10 hours, it's depends on hyperparameters.
Thanks for the fast Response . I use the exact Same hyperparameters as your example: export CUDA_VISIBLE_DEVICES=0 # or e.g. 0,1,2,3 export MODEL_PATH=<PATH_TO_MODEL_ON_HUB> export DATASET_PATH=<INSERT DATASET NAME OR PATH TO CUSTOM DATA> export SAVE_PATH=/path/to/save/quantized/model/ export WANDB_PROJECT=MY_AQ_EXPS export WANDB_NAME=COOL_EXP_NAME python main.py $MODEL_PATH $DATASET_PATH --nsamples=1024 \ --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 \ --relative_mse_tolerance=0.01 --finetune_relative_mse_tolerance=0.001 \ --finetune_batch_size=32 --local_batch_size=1 --offload_activations \ --wandb --save $SAVE_PATH
And after Allmost 24 hours on 2 a40 48 GB gpus the Script was only on layer 8
@Vahe1994 please, take a look
Thank you for reporting this! I haven't experimented with quantization on 4090 (by the way, 8 4090 might be overkill) or on A40, but the processing time appears to be unusually slow. It's possible that recent code changes have caused this slowdown in the quantization process, though I'm not certain. I'll have a look at this and provide an update if I discover anything.
If you need more Details or a test after code updates i will be happy to help.
+1
Hi! Unfortunately, I could not obtain access neither for 4090 nor a40. So I conducted several experiments on A100. I tried to quantize LLama-2 7B with provided parameters with recent commit and commit that was month ago, both gave 14.5 hours full quantization time on 2 A100 with ppl evaluation. Then I tried quantization on Mistral-7B model with recent commit with 2 A100 and also got slightly more, but acceptable time around 17 hours. Note that 16bit codebooks quantization much slower than with smaller codebooks, and quantization time is dependent on relative tolerance.
Can you please try quantizing with the same config, but on one GPU, and reduce the number of samples proportionally to fit in memory. This is necessary to understand if the problem is related to inter-gpu communication or local computation. For instance, I have once encountered such a problem with a faulty gpu-to-gpu pcie bus.
If you need more Details or a test after code updates i will be happy to help.
Here now I'm trying with single H100 / Intel Xeon Platinum 8468 (192) / 2TB ram, and getting
Saving layer 0... to model-name-masked/0.pth
{'layer_time': 3053.6494579315186, 'out_loss': 0.04229441657662392, 'Step': 0}
or about 50 min per layer.
So, if I extrapolate it, I predict: ~26.5 hours of quantizing, and that's... not expected :(
Launch flags
python main.py $MODEL_NAME $DATASET_NAME --nsamples=1024 --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 --local_batch_size=1 --save ${MODEL_NAME}-AQLM --dtype bfloat16 --beam_size 1 --max_epochs 100 --relative_mse_tolerance=0.01 --finetune_max_epochs 0 --offload_activations
I was able to complete the quantization of the mistral 7b variant mentioned previously using the top-tier GH200 within a shocking 22 hours. That can't be possible, right?
P.S. It seems to be struggling the most on the 'initializing with kmeans:' stage, spending about 5 minutes at the'mlp.*' sublayer stages.
Hello, @iamwavecut! As I mentioned earlier, using two A100 GPUs, the quantization time for the Mistral-7b model is approximately 17 hours. In light of this, the reported numbers for one H100 (26 hours) by you seem to be OK. Although quantization with AQLM is relatively time-consuming, it should be noted that this is a one-time process.
To expedite the model quantization process, consider using multiple GPUs in parallel(the provided code supports the use of multiple GPUs for a single model). If you're looking to reduce the quantization time further and are willing to make a potential compromise on perplexity (ppl), you can adjust quantization parameters such as nsamples, relative_mse_tolerance, finetune_relative_mse_tolerance, nbits_per_codebook, init_max_iter, init_max_points_per_centroid, etc.. Hope this helps. If you have any additional questions, please feel free to ask.
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.