GPTQ-for-LLaMa icon indicating copy to clipboard operation
GPTQ-for-LLaMa copied to clipboard

Questions about group size

Open DanielWe2 opened this issue 1 year ago • 6 comments

From the research paper and the tables in the readme it looks like that group-size 64 is very effective in improving the quality of the models. Most noticable in the smaller models or in the 3bit version.

The tables suggest that group size is somehow usable but the README also states that group size can not be used with CUDA? But this whole project needs CUDA? I build an group size 64 model but I can not run the benchmark or inference.

Is group size usable? If so, how?

DanielWe2 avatar Mar 10 '23 15:03 DanielWe2

Implementing groupsize with cuda seems very difficult. And if you don't use cuda, you won't get the benefits. Therefore, inference and benchmark are implemented based on cuda.

qwopqwop200 avatar Mar 10 '23 16:03 qwopqwop200

OK, if it can not be used with group size how did you generate the tables with the benchmark results with group size?

DanielWe2 avatar Mar 10 '23 16:03 DanielWe2

python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --groupsize 64 I got the result through this code. This is only implemented with pytorch

qwopqwop200 avatar Mar 10 '23 16:03 qwopqwop200

Ok the only change is that it is not using "CUDA_VISIBLE_DEVICES=0"? The downside it is, that it is slow that way?

DanielWe2 avatar Mar 10 '23 16:03 DanielWe2

It just makes it run on one gpu.

qwopqwop200 avatar Mar 10 '23 16:03 qwopqwop200

OK, I see. Without benmark or inference it will just call the llama_eval for each dataset. And that can be used without CUDA but actual inference or benchmark needs CUDA and that can not be used with groupsize?

I have a lot to learn here, but for me that seems confusing. I mean eval also seems to use to model and compare the model result with the expected result from the dataset. What is the issue with using that function for normal inference?

DanielWe2 avatar Mar 10 '23 16:03 DanielWe2

Benchmarks may work, but inference doesn't. Because inference doesn't have llama_sequential.

qwopqwop200 avatar Mar 11 '23 02:03 qwopqwop200