text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

Support for MPT 4bit

Open Nixellion opened this issue 1 year ago • 16 comments

Description

Support for loading 4bit quantized MPT models

Additional Context

Occam released it, and added support for loading it to his GPTQ fork and his KoboldAI fork, which may be useful for reference on changes needed to be made.

https://huggingface.co/OccamRazor/mpt-7b-storywriter-4bit-128g

Nixellion avatar May 07 '23 06:05 Nixellion

I am also interested. The extremely long context opens up new possibilities. I think this would be a really attractive feature to have.

FrederikAbitz avatar May 07 '23 11:05 FrederikAbitz

This was working last night. Broke today right around "superbooga" so I reset to 8c06eeaf8... I don't know if that's where it's broken exactly. But it does work on that commit.

But it won't load quantized https://huggingface.co/OccamRazor/mpt-7b-storywriter-4bit-128g

Traceback (most recent call last): File “~/text-generation-webui/server.py”, line 59, in load_model_wrapper shared.model, shared.tokenizer = load_model(shared.model_name) File “~/text-generation-webui/modules/models.py”, line 159, in load_model model = load_quantized(model_name) File “~/text-generation-webui/modules/GPTQ_loader.py”, line 149, in load_quantized exit() File “~/anaconda3/envs/text/lib/python3.10/_sitebuiltins.py”, line 26, in call raise SystemExit(code) SystemExit: None

Or on latest 9754d6a (w/high cpu on 1 core)...

~/.cache/huggingface/modules/transformers_modules/OccamRazor_mpt-7b-storywriter-4bit-128g/attention.py:148: UserWarning: Using attn_impl: torch. If your model does not use alibi or prefix_lm we recommend using attn_impl: flash otherwise we recommend using attn_impl: triton. warnings.warn('Using attn_impl: torch. If your model does not use alibi or ' + 'prefix_lm we recommend using attn_impl: flash otherwise ' + 'we recommend using attn_impl: triton.') You are using config.init_device='cpu', but you can also use config.init_device="meta" with Composer + FSDP for fast initialization.

Both load https://huggingface.co/mosaicml/mpt-7b-storywriter and santacoder.

cornpo avatar May 07 '23 20:05 cornpo

Are you using the --trust-remote-code flag?

ClayShoaf avatar May 08 '23 04:05 ClayShoaf

Is it okay to use llama for this model? When I tried to set model type to mpt I got his error:

ERROR:Unknown pre-quantized model type specified. Only 'llama', 'opt' and 'gptj' are supported

When using model type llama it does work, but the output is nonsensical:

Common sense questions and answers

Question: Who's the president of the United States? Factual answer: The President is a man named George Bush, but I'm not sure what he looks like. What do you think?"—and so on," says my friend David Egan (a former colleague from Stony Brook University) in his book The Last Days Of America, which was published by Random House Canada last year."I donned an orange-colored baseball cap with white stripes running down each side; it had been given to me as part-payment for services rendered at one time or another over several years ago – if anyone knows where that guy has gone now would be appreciated! It seems unlikely we'll ever see him again!" In fact there are two things about this story — neither will die out completely until after midnight tonight...but then who cares?! This article may well become obsolete before long anyway because even though they're both dead already - "It isn't going anywhere,' said Darth Vader when asked how much money does she want?'" If your name wasn't

I also get 1.12 tokens/s on a RTX 3060 12GB. The speed is much better using Occam's fork, but the quality is the same.

On top of that it takes 9 minutes to load the model, both with Ooba's webui and Occam's fork of TavenAI.

silvestron avatar May 08 '23 05:05 silvestron

@silvestron that's pretty much what I would expect this model to do. It didn't make sense for everyone to get all excited about 65K context size with no knowledge about whether or not the model would actually be coherent. Given the track record of non-llama models, it was unlikely that it would be up to par in that department. As for the slow speeds, I would guess that's a byproduct of the huge context window, even if you're not using the full context, but I could be wrong.

ClayShoaf avatar May 08 '23 12:05 ClayShoaf

@jpturcotte Can you run git show inside the text-generation-webui folder to see what commit you're on? How do you get to run it though? I have to manually specify the model type as llama.

silvestron avatar May 08 '23 18:05 silvestron

Is that with the cache set to true or false in the models config file?

Is it okay to use llama for this model? When I tried to set model type to mpt I got his error:


ERROR:Unknown pre-quantized model type specified. Only 'llama', 'opt' and 'gptj' are supported

When using model type llama it does work, but the output is nonsensical:

Common sense questions and answers

Question: Who's the president of the United States?

Factual answer: The President is a man named George Bush, but I'm not sure what he looks like. What do you think?"—and so on," says my friend David Egan (a former colleague from Stony Brook University) in his book The Last Days Of America, which was published by Random House Canada last year."I donned an orange-colored baseball cap with white stripes running down each side; it had been given to me as part-payment for services rendered at one time or another over several years ago – if anyone knows where that guy has gone now would be appreciated! It seems unlikely we'll ever see him again!" In fact there are two things about this story — neither will die out completely until after midnight tonight...but then who cares?! This article may well become obsolete before long anyway because even though they're both dead already - "It isn't going anywhere,' said Darth Vader when asked how much money does she want?'" If your name wasn't

I also get 1.12 tokens/s on a RTX 3060 12GB. The speed is much better using Occam's fork, but the quality is the same.

On top of that it takes 9 minutes to load the model, both with Ooba's webui and Occam's fork of TavenAI.

NicolasMejiaPetit avatar May 08 '23 18:05 NicolasMejiaPetit

@NickWithBotronics "use_cache": false in the config file, I haven't touched it. How much does the webui respect the config file though? It doesn't seem to care about the model type, it always ignores it.

silvestron avatar May 08 '23 19:05 silvestron

I replaced git pull with git checkout 85238de in webui.py, but it looks like going back to older commit breaks things. Maybe running a clean install with that commit would be better.

silvestron avatar May 08 '23 22:05 silvestron

I'd also add that on a working, up to date installation, I tried using llama, gptj, and opt as model type and gave the same results.

silvestron avatar May 08 '23 22:05 silvestron

Are we talking about the 4bit model? That doesn't work if you don't specify a model. #1894 I get the same error if I don't give it a model type.

silvestron avatar May 09 '23 01:05 silvestron

@NickWithBotronics "use_cache": false in the config file, I haven't touched it. How much does the webui respect the config file though? It doesn't seem to care about the model type, it always ignores it.

I was testing out the wizardlm model when it first came out, the cache was set to false I set it to true and got 5x faster responses.

NicolasMejiaPetit avatar May 09 '23 10:05 NicolasMejiaPetit

That actually made the token generation faster, however the initialization time, that on my hardware takes 9 minutes didn't change. The config has "init_device": "cpu" and the console says you can change it to meta for better speed but that didn't work for me. Changing it to cuda didn't work either because it runs out of VRAM (12GB in my case). Maybe with more VRAM the initialization would be faster on GPU.

silvestron avatar May 09 '23 18:05 silvestron

Great I'm also on 12gb of vram,so unless somehow I get mpt7b 4bit working its never running on my gpu. I read somewhere I can use up to 20gb when doing full inference and its got a context window of essentially a book, its even more rough because this model according to my terminal dosent support auto-devices.

NicolasMejiaPetit avatar May 09 '23 19:05 NicolasMejiaPetit

@silvestron Darnit, wrong thread! Sorry for getting your hopes up...

jpturcotte avatar May 10 '23 18:05 jpturcotte

@jpturcotte All good, I couldn't do much with this model anyway without much of VRAM anyway. I guess multi-GPU is going to be the only way to run models that can handle this many tokens. At least for now.

silvestron avatar May 10 '23 21:05 silvestron

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

github-actions[bot] avatar Aug 19 '23 23:08 github-actions[bot]