AQLM icon indicating copy to clipboard operation
AQLM copied to clipboard

Quantization Time

Open DRXD1000 opened this issue 1 year ago • 11 comments

How long is the expected time to quantize a 7b mistral model ?

DRXD1000 avatar Feb 24 '24 17:02 DRXD1000

Hello! On 2 GPU it would take approximately 6 –10 hours, it's depends on hyperparameters.

Vahe1994 avatar Feb 26 '24 11:02 Vahe1994

I've tried to quantize another variant of mistral last week, but I was working on a layer 0 for like 4 hours using 8 x 4090 NVLINKed EPYC class server, so I aborted it due to the projected costs.

Is it OK to do it this slow with that mentioned configuration? Just trying to understand the optimal way of doing this.

iamwavecut avatar Feb 26 '24 13:02 iamwavecut

Hello! On 2 GPU it would take approximately 6 –10 hours, it's depends on hyperparameters.

Thanks for the fast Response . I use the exact Same hyperparameters as your example: export CUDA_VISIBLE_DEVICES=0 # or e.g. 0,1,2,3 export MODEL_PATH=<PATH_TO_MODEL_ON_HUB> export DATASET_PATH=<INSERT DATASET NAME OR PATH TO CUSTOM DATA> export SAVE_PATH=/path/to/save/quantized/model/ export WANDB_PROJECT=MY_AQ_EXPS export WANDB_NAME=COOL_EXP_NAME python main.py $MODEL_PATH $DATASET_PATH --nsamples=1024 \ --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 \ --relative_mse_tolerance=0.01 --finetune_relative_mse_tolerance=0.001 \ --finetune_batch_size=32 --local_batch_size=1 --offload_activations \ --wandb --save $SAVE_PATH

And after Allmost 24 hours on 2 a40 48 GB gpus the Script was only on layer 8

DRXD1000 avatar Feb 26 '24 16:02 DRXD1000

@Vahe1994 please, take a look

iamwavecut avatar Feb 26 '24 16:02 iamwavecut

Thank you for reporting this! I haven't experimented with quantization on 4090 (by the way, 8 4090 might be overkill) or on A40, but the processing time appears to be unusually slow. It's possible that recent code changes have caused this slowdown in the quantization process, though I'm not certain. I'll have a look at this and provide an update if I discover anything.

Vahe1994 avatar Feb 26 '24 17:02 Vahe1994

If you need more Details or a test after code updates i will be happy to help.

DRXD1000 avatar Feb 26 '24 17:02 DRXD1000

+1

iamwavecut avatar Feb 26 '24 20:02 iamwavecut

Hi! Unfortunately, I could not obtain access neither for 4090 nor a40. So I conducted several experiments on A100. I tried to quantize LLama-2 7B with provided parameters with recent commit and commit that was month ago, both gave 14.5 hours full quantization time on 2 A100 with ppl evaluation. Then I tried quantization on Mistral-7B model with recent commit with 2 A100 and also got slightly more, but acceptable time around 17 hours. Note that 16bit codebooks quantization much slower than with smaller codebooks, and quantization time is dependent on relative tolerance.

Can you please try quantizing with the same config, but on one GPU, and reduce the number of samples proportionally to fit in memory. This is necessary to understand if the problem is related to inter-gpu communication or local computation. For instance, I have once encountered such a problem with a faulty gpu-to-gpu pcie bus.

If you need more Details or a test after code updates i will be happy to help.

Vahe1994 avatar Feb 29 '24 18:02 Vahe1994

Here now I'm trying with single H100 / Intel Xeon Platinum 8468 (192) / 2TB ram, and getting

Saving layer 0... to model-name-masked/0.pth                                                                  
{'layer_time': 3053.6494579315186, 'out_loss': 0.04229441657662392, 'Step': 0}

or about 50 min per layer.

So, if I extrapolate it, I predict: ~26.5 hours of quantizing, and that's... not expected :(

Launch flags python main.py $MODEL_NAME $DATASET_NAME --nsamples=1024 --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 --local_batch_size=1 --save ${MODEL_NAME}-AQLM --dtype bfloat16 --beam_size 1 --max_epochs 100 --relative_mse_tolerance=0.01 --finetune_max_epochs 0 --offload_activations

iamwavecut avatar Feb 29 '24 20:02 iamwavecut

I was able to complete the quantization of the mistral 7b variant mentioned previously using the top-tier GH200 within a shocking 22 hours. That can't be possible, right?

P.S. It seems to be struggling the most on the 'initializing with kmeans:' stage, spending about 5 minutes at the'mlp.*' sublayer stages.

iamwavecut avatar Mar 02 '24 22:03 iamwavecut

Hello, @iamwavecut! As I mentioned earlier, using two A100 GPUs, the quantization time for the Mistral-7b model is approximately 17 hours. In light of this, the reported numbers for one H100 (26 hours) by you seem to be OK. Although quantization with AQLM is relatively time-consuming, it should be noted that this is a one-time process.

To expedite the model quantization process, consider using multiple GPUs in parallel(the provided code supports the use of multiple GPUs for a single model). If you're looking to reduce the quantization time further and are willing to make a potential compromise on perplexity (ppl), you can adjust quantization parameters such as nsamples, relative_mse_tolerance, finetune_relative_mse_tolerance, nbits_per_codebook, init_max_iter, init_max_points_per_centroid, etc.. Hope this helps. If you have any additional questions, please feel free to ask.

Vahe1994 avatar Mar 03 '24 14:03 Vahe1994

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar Apr 03 '24 01:04 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Apr 17 '24 01:04 github-actions[bot]