Egiazarian Vage
Egiazarian Vage
Hello! Hard to tell. Can you please send the logs of the run? What was the error message?
Hello! On 2 GPU it would take approximately 6 –10 hours, it's depends on hyperparameters.
Thank you for reporting this! I haven't experimented with quantization on 4090 (by the way, 8 4090 might be overkill) or on A40, but the processing time appears to be...
Hi! Unfortunately, I could not obtain access neither for 4090 nor a40. So I conducted several experiments on A100. I tried to quantize LLama-2 7B with provided parameters with recent...
Hello, @iamwavecut! As I mentioned earlier, using two A100 GPUs, the quantization time for the Mistral-7b model is approximately 17 hours. In light of this, the reported numbers for one...
Hello! Thank you for your interest in the project. Yes indeed, AQLM quantization takes considerably longer to calibrate than simpler quantization methods such as GPTQ. This only impacts quantization time,...
Hi! Hope this helps: ```WANDB_PROJECT="wandb_project" WANDB_NAME="wandb_name" HF_HOME="/mnt/LLM" CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 OMP_NUM_THREADS=16 MKL_NUM_THREADS=16 python main.py meta-llama/Llama-2-70b-hf "pajama" --relative_mse_tolerance=0.01 --finetune_relative_mse_tolerance=0.001 --nsamples=2048 --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 --finetune_batch_size=32 --local_batch_size=2 --wandb --save="path_to_save"```
If you want farther improve ppl, you can additionally run global fine-tuning after you obtained quantized model see https://github.com/Vahe1994/AQLM/pull/50 for the code and see https://github.com/Vahe1994/AQLM/issues/49 for example how to run...
Hi! It is, in fact, quite stuck :) I believe what you're observing is sampling degeneration, which happens because the model performs "greedy" inference, which makes it inclined to repeat...
Hello! I believe the issue may not be related to quantization. I ran your prompt on not quantized LLama-2 7b. And got repetition too(see images below). This is known problem....