GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard
Extraneous data point
LLaMa-13B-GPTQ-4-128 says C4 scores 7.60. That seems out of place compared to 16,8,3 bits. Was that a typo, intended to be 6.60 or 6.70?
I think it's probably a typo. I will test again within 24 hours and fix it
I was extrapolating scaling laws to predict the performance of 65-3bit, as well as HumanEval and MBPP benchmarks for the models in general. I found the typo: you swapped the position of 13B-4-128 and 13B-3-128. 4bit should be 6.70, 3bit should be 7.60.
Or I could be entirely wrong. 7.60 is the exact same as what's in the 7B column table there. Either way, you should probably re-check 13B-3bit too.
@qwopqwop200 based on my projections, 65B-3bit-128 will consume 29-30 GB of RAM. That is just enough to fit my GPU (32 GB). I'd like to compare it to 33B-4bit. Would you mind this formal request to have you benchmark that specific config?
Also, since you're also quantizing stuff with GPTQ, would you mind hosting the quantized models on HuggingFace or sending them to me somehow? I have a few specific configs in mind, which don't exist on HF (not from decapoda, etc.). I could theoretically convert your quantization software from CUDA to Metal, but it might take a bit of time.