text-generation-webui Allow users to load 2, 3 and 4 bit llama models

This replaces --load-in-4bit with a more flexible --llama-bits arg to allow for loading 2, 3 and 4 bit models as the GPTQ implementation supports it. I also fixed a loading issue with .pt files which are not in the root of the models folder as it still tried to load from there even if the file did not exist.

Mar 10 '23 21:03 ItsLogic

2-bit LLaMA!? What's the loss in quality at that point? That's phenomenal if it's not significant!

Mar 10 '23 22:03 moorehousew

2-bit LLaMA!? What's the loss in quality at that point? That's phenomenal if it's not significant!

From the benchmarks provided in the GPTQ repo loss in quality for 2bit is huge, its basically pointless to use a 2bit model. 3bit quality loss is huge for 7B but significantly more reasonable for 13B. Im planning to test 3bit 30B (in an hour or so when it converts) to check the loss in quality myself as that hasnt been benchmarked.

Mar 10 '23 22:03 ItsLogic

I think that removing --load-in-4bit would lead to confusion now that it is already implemented and being used. Maybe let --load-in-4bit and --llama-bits coexist?

Also, for now this is restricted to LLaMA, but I assume that in the future other models will be quantized to 4-bit as well. So the name --llama-bits is not a good one.

Mar 10 '23 22:03 oobabooga

I think that removing --load-in-4bit would lead to confusion now that it is already implemented and being used. Maybe let --load-in-4bit and --llama-bits coexist?

Yeah I can see it causing some confusion we certainly could have them coexist to solve that. I was concerned about code mess and obsolete args but I can edit the function to allow both to be used interchangeably.

Also, for now this is restricted to LLaMA, but I assume that in the future other models will be quantized to 4-bit as well. So the name --llama-bits is not a good one.

Confusion was actually my reason for naming it --llama-bits instead of just --bits as it would imply it could be used as a replacement for the standard --load-in-8bit which it cant. I do understand why you wouldn't want it tied to a certain model family though as it could cause extra confusion down the line when it becomes more generalized.

Maybe --quant to specify its for pre quantized models? I'm open to ideas for a name replacement unless you just want to have --bits and the confusion that may arise.

Mar 10 '23 23:03 ItsLogic

Both --load-in-4bit and --llama-bits now coexist. I changed the description of --llama-bits to better describe the functionality (including the support for pre-quantized 8bit) but ill leave the name until you respond.

Mar 10 '23 23:03 ItsLogic

2-bit LLaMA!? What's the loss in quality at that point? That's phenomenal if it's not significant!

From the benchmarks provided in the GPTQ repo loss in quality for 2bit is huge, its basically pointless to use a 2bit model. 3bit quality loss is huge for 7B but significantly more reasonable for 13B. Im planning to test 3bit 30B (in an hour or so when it converts) to check the loss in quality myself as that hasnt been benchmarked.

Did you get any results there? In terms of performance/ hardware usage.

Mar 11 '23 20:03 KarHam

Did you get any results there? In terms of performance/ hardware usage.

I actually just made a pull request to the GPTQ repo with my results for 3 and 4 bit 30B. Since its not merged you will need to look here: https://github.com/ItsLogic/GPTQ-for-LLaMa#memory-usage

Mar 11 '23 20:03 ItsLogic

Some small changes:

Since 4-bit LLaMA has become its own field of study, I have moved it into a separate file: modules/quantized_LLaMA.py
Rename --llama-bits to --gptq-bits, since in practice GPTQ is what we are going to continue using for quantization.

Mar 12 '23 14:03 oobabooga

text-generation-webui text-generation-webui copied to clipboard

Allow users to load 2, 3 and 4 bit llama models

text-generation-webui
text-generation-webui copied to clipboard