GPTQ-for-LLaMa Questions about group size

From the research paper and the tables in the readme it looks like that group-size 64 is very effective in improving the quality of the models. Most noticable in the smaller models or in the 3bit version.

The tables suggest that group size is somehow usable but the README also states that group size can not be used with CUDA? But this whole project needs CUDA? I build an group size 64 model but I can not run the benchmark or inference.

Is group size usable? If so, how?

Mar 10 '23 15:03 DanielWe2

Implementing groupsize with cuda seems very difficult. And if you don't use cuda, you won't get the benefits. Therefore, inference and benchmark are implemented based on cuda.

Mar 10 '23 16:03 qwopqwop200

OK, if it can not be used with group size how did you generate the tables with the benchmark results with group size?

Mar 10 '23 16:03 DanielWe2

python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --groupsize 64 I got the result through this code. This is only implemented with pytorch

Mar 10 '23 16:03 qwopqwop200

Ok the only change is that it is not using "CUDA_VISIBLE_DEVICES=0"? The downside it is, that it is slow that way?

Mar 10 '23 16:03 DanielWe2

It just makes it run on one gpu.

Mar 10 '23 16:03 qwopqwop200

OK, I see. Without benmark or inference it will just call the llama_eval for each dataset. And that can be used without CUDA but actual inference or benchmark needs CUDA and that can not be used with groupsize?

I have a lot to learn here, but for me that seems confusing. I mean eval also seems to use to model and compare the model result with the expected result from the dataset. What is the issue with using that function for normal inference?

Mar 10 '23 16:03 DanielWe2

Benchmarks may work, but inference doesn't. Because inference doesn't have llama_sequential.

Mar 11 '23 02:03 qwopqwop200

GPTQ-for-LLaMa GPTQ-for-LLaMa copied to clipboard

Questions about group size

GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard