text-generation-webui
text-generation-webui copied to clipboard
Allow users to load 2, 3 and 4 bit llama models
This replaces --load-in-4bit with a more flexible --llama-bits arg to allow for loading 2, 3 and 4 bit models as the GPTQ implementation supports it. I also fixed a loading issue with .pt files which are not in the root of the models folder as it still tried to load from there even if the file did not exist.
2-bit LLaMA!? What's the loss in quality at that point? That's phenomenal if it's not significant!
2-bit LLaMA!? What's the loss in quality at that point? That's phenomenal if it's not significant!
From the benchmarks provided in the GPTQ repo loss in quality for 2bit is huge, its basically pointless to use a 2bit model. 3bit quality loss is huge for 7B but significantly more reasonable for 13B. Im planning to test 3bit 30B (in an hour or so when it converts) to check the loss in quality myself as that hasnt been benchmarked.
I think that removing --load-in-4bit
would lead to confusion now that it is already implemented and being used. Maybe let --load-in-4bit
and --llama-bits
coexist?
Also, for now this is restricted to LLaMA, but I assume that in the future other models will be quantized to 4-bit as well. So the name --llama-bits
is not a good one.
I think that removing
--load-in-4bit
would lead to confusion now that it is already implemented and being used. Maybe let--load-in-4bit
and--llama-bits
coexist?
Yeah I can see it causing some confusion we certainly could have them coexist to solve that. I was concerned about code mess and obsolete args but I can edit the function to allow both to be used interchangeably.
Also, for now this is restricted to LLaMA, but I assume that in the future other models will be quantized to 4-bit as well. So the name
--llama-bits
is not a good one.
Confusion was actually my reason for naming it --llama-bits
instead of just --bits
as it would imply it could be used as a replacement for the standard --load-in-8bit
which it cant. I do understand why you wouldn't want it tied to a certain model family though as it could cause extra confusion down the line when it becomes more generalized.
Maybe --quant
to specify its for pre quantized models?
I'm open to ideas for a name replacement unless you just want to have --bits
and the confusion that may arise.
Both --load-in-4bit
and --llama-bits
now coexist. I changed the description of --llama-bits
to better describe the functionality (including the support for pre-quantized 8bit) but ill leave the name until you respond.
2-bit LLaMA!? What's the loss in quality at that point? That's phenomenal if it's not significant!
From the benchmarks provided in the GPTQ repo loss in quality for 2bit is huge, its basically pointless to use a 2bit model. 3bit quality loss is huge for 7B but significantly more reasonable for 13B. Im planning to test 3bit 30B (in an hour or so when it converts) to check the loss in quality myself as that hasnt been benchmarked.
Did you get any results there? In terms of performance/ hardware usage.
Did you get any results there? In terms of performance/ hardware usage.
I actually just made a pull request to the GPTQ repo with my results for 3 and 4 bit 30B. Since its not merged you will need to look here: https://github.com/ItsLogic/GPTQ-for-LLaMa#memory-usage
Some small changes:
- Since 4-bit LLaMA has become its own field of study, I have moved it into a separate file:
modules/quantized_LLaMA.py
- Rename
--llama-bits
to--gptq-bits
, since in practice GPTQ is what we are going to continue using for quantization.