text-generation-webui
text-generation-webui copied to clipboard
Add MPT quantized model support
I have test it on: Model: https://huggingface.co/4bit/mpt-7b-storywriter-4bit-128g GPTQ: https://github.com/qwopqwop200/GPTQ-for-LLaMa/commit/5731aa11de56affe6e8c88cea66a171045ad1dce
And it is usable using following command: python3 server.py --notebook --api --model 4bit_mpt-7b-storywriter-4bit-128g --trust-remote-code --wbits 4 --groupsize 128 --model_type mpt
Can you check if this also works for moss 4-bit?
https://github.com/oobabooga/text-generation-webui/blob/34970ea3af8f88c501e58fef2fc5c489c8df2743/modules/GPTQ_loader.py#L100
There is hardcoded sequence length in _load_quant. Does it work with context sizes over 2048?
MPT-Storywriter should support up to 65k contexts.
IMHO, important update. We need stuff with 65k context!
I find that this model loads fine setting model_type to llama
python server.py --model 4bit_mpt-7b-storywriter-4bit-128g --trust-remote-code --model_type llama
@mayaeary if I increase seqlen to 36000 and increase "Truncate the prompt up to this length" to 8192 under "Parameters" the model does generate, but this is hacky and I have no idea if it is the right way to do it (what is even seqlen?).
@oobabooga https://huggingface.co/4bit/mpt-7b-storywriter-4bit-128g/blob/main/config.json the model does define "max_seq_len": 65536, you could stick with that
I'll close this PR because
- MPT loads fine with model_type = llama
- MPT is not officially supported by gptq-for-llama, so defining a "mpt" model_type is undefined behavior
Soon it should be added in a more proper way to https://github.com/PanQiWei/AutoGPTQ, and it can already be loaded with --load-in-4bit starting from the 16-bit weights