text-generation-webui
text-generation-webui copied to clipboard
"Transformers bump" commit ruins gpt4-x-alpaca if using an RTX3090: model loads, but talks gibberish
Describe the bug
Multiple people in the below linked reddit thread mentioning that the model is not working properly on high-end RTX cards. (3090,3090TI,4090) This is seemingly caused by this commit: https://github.com/oobabooga/text-generation-webui/commit/113f94b61ee0e85bd791992da024cb5fc6beac93
I can confirm that using my 3060 and main branch, there's no issue at all - this seemingly only affects higher-end cards. It also worth mentioning, that in the case of the 3060 I'm using auto-devices and --gpu--memory 8.
My case:
3090 24GB 8-bit, model:gpt4-x-alpaca, preset:llama-creative:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write a poem about the transformers Python library.
Mention the word "large language models" in that poem.
### Response:
, large language models
I'm learning to speak English
I'm trying to learn Python
But I don't know how to code
[...]
There's an active discussion of this on Reddit where I reported it first: https://www.reddit.com/r/Oobabooga/comments/12ez276/3060_vs_3090_same_model_and_presets_but_very/
Seemingly git checkout 5f4f38ca5d11bd1739c0b99e26bb644637a04e0a
and pip install protobuf==3.20
resolve the problem.
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
Git clone the newest version of the repo, try to run gpt4-x-alpaca using a 3090,3090TI,4090
Screenshot
Logs
No errors.
System Info
Ubuntu 22.04.2
ASUS TUF RTX 3090 24GB
You may need to re-download and overwrite the tokenizer files. The transformers library changed the format in a way that requires reconversion.
https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#option-1-pre-converted-weights
Replacing these 4 files with those from the updated llama-13b
conversion fixes the incoherent generation that starts with a comma
tokenizer_config.json
tokenizer.model
special_tokens_map.json
generation_config.json
But then it seems to ignore the eos token and start generating random text after it's finished. Not sure why
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write a poem about the transformers Python library.
Mention the word "large language models" in that poem.
### Response:
There once was a library named Transformers,
Whose power could not be ignored,
It allowed for manipulation and molding,
Of text data, both large and small.
With the help of this mighty tool,
Developers wrote with gusto,
Creating tasks and scripts so clever,
That even the largest language models would bow. # Hydroptila luctuosa
Hydroptila luctuosa är en nattsländeart som beskrevs av Banks 1904. Hydroptila luctuosa ingår i släktet Hydroptila och familjen ryssjenattsländerna. Inga underarter finns
I have the same issue, in which the model starts talking nonsense after a successful answer. Seems to happen with different models (Tested with llama-30b-4bit-128g, llama-13b-4bit-128g and Alpaca-30b-4bit-128g). In chat mode it gives a couple of normal answers until then starts spewing some random info (sometimes in polish or french, weirdly)
Feels related to #900 and #860
I am relatively new to his, so I haven't played much with the tool yet, but when I first tried it, some days ago (less than a week) it did not seem to have this issue at all, or at least I cannot recall it.
System Info
OS: Windows 10
GPU: NVIDIA RTX4090
CPU: AMD 1920x
RAM: 64 GB
I'm not using text-generation-webui (I'm writing code which imports directly from transformers git repo) but I noticed the same issue. The way I fixed it was to use LlamaTokenizerFast instead of LlamaTokenizer.
LlamaTokenizerFast uses the tokenizers rust library with the same name by huggingface while LlamaTokenizer should use sentencepiece. There is probably some discrepancy there.
(for the records, I got the model from: https://huggingface.co/chavinlo/gpt4-x-alpaca so I think my tokenizers are up to date, I'm running on a 4090)
I find replacing gpt4-x-alpaca's tokenizer.model
tokenizer_config.json
special_tokens_map.json
with the newly converted tokenizers of llama fixes the problem for me.
Using llama-30b-4bit-128g downloaded here: https://huggingface.co/Neko-Institute-of-Science/LLaMA-30B-4bit-128g
This is getting ridiculous...

Using gpt4-x-alpaca-13b-native-4bit-128g
from https://huggingface.co/anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g/tree/main
Response to "hi":
. when- during on in on but... while............................................... trust.........................................................................................................................h..............
Or this when I replace these files
tokenizer_config.json
tokenizer.model
special_tokens_map.json
generation_config.json
with llama-13b
Response to "hi":
-21(42-°--2-22-2--2-2 (--22---2-22121212-22-3-1-2-0---2--2----------------2-2---2-2-2---2-----------------------------------------------------------
Installed with the automatic installer
Specs:
OS: Windows 11 2262.1413
GPU: NVIDIA RTX4090
CPU: AMD 5900X
RAM: 128 GB
vicuna-13b-GPTQ-4bit-128g
works fine
Is it just this model? Because I gave in, downloaded both versions, then saw it do this.
Maybe related to Act Order + true Sequential + group size together and triton vs cuda.
I have not seen it happen with any others and I didn't update or change anything related to tokenizers.
When I used the "cuda" version. But it does hallucinate a lot.
I did get it working after changing the tokenizer files, but now after responding correctly to the prompt, it keeps generating random text. With that many problems wouldn't it be better to return to the older version of transformers?
The tokenizer is broken in the old version. It adds extra spaces to generations and breaks the stopping_criteria in chat mode.
Got it. Is there a way to stop the model from generating random text after it has finished responding to the prompt? Because that is the only problem I'm having at the moment, and its not only with this model, it also happens with the Llama13B model when using the Alpaca lora
Using
gpt4-x-alpaca-13b-native-4bit-128g
from https://huggingface.co/anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g/tree/mainResponse to "hi":
. when- during on in on but... while............................................... trust.........................................................................................................................h..............
Or this when I replace these files
tokenizer_config.json
tokenizer.model
special_tokens_map.json
generation_config.json
withllama-13b
Response to "hi":
-21(42-°--2-22-2--2-2 (--22---2-22121212-22-3-1-2-0---2--2----------------2-2---2-2-2---2-----------------------------------------------------------
Installed with the automatic installer
Specs:
OS: Windows 11 2262.1413 GPU: NVIDIA RTX4090 CPU: AMD 5900X RAM: 128 GB
vicuna-13b-GPTQ-4bit-128g
works fine
I'm using an RTX 3060 12GB with the gpt-x-alpaca-13b-native-4bit-128g-cuda.pt model and I get almost the exact same output. Has anyone found a working tokenizer that solves this?
Assistant Hello there!
You Hi!
Assistant . when- during onon in. but. while............................................. trust................... (.................................................................................
Good news, I got the CUDA version of gpt4 x alpaca working by removing the gpt-x-alpaca-13b-native-4bit-128g.pt file from the directory and only having the one cuda .pt file present
Edit: While it doesn't spit out gibberish, it often completely misunderstands the prompt and replies with nonsense answers
This seems to have fixed the detailing at the end of generations: https://github.com/oobabooga/text-generation-webui/commit/a3085dba073fe8bdcfb5120729a84560f5d024c3
The question is why setting this manually is necessary. It could be:
- A bug in the transformers library
- A bug in the converted tokenizer files
- Me using the transformers library incorrectly
Just pulled this change, gpt4-x-alpaca seems to be working much better now. No gibberish and its coherent and actually listens to the prompt
Sadly the issue persists with gpt4-x-alpaca-13b-native-4bit-128g
in the https://github.com/oobabooga/text-generation-webui/commit/a3085dba073fe8bdcfb5120729a84560f5d024c3 commit
I have reconverted llama-7b and compared the resulting tokenizer files to the ones in Safe-LLaMA-HF-v2 (4-04-23)
by @USBHost. Most files are identical, except for two: special_tokens_map.json
and tokenizer_config.json
.
Here is a comparison between the two conversions:
>>> from transformers import LlamaTokenizer
# My conversion
>>> tokenizer = LlamaTokenizer.from_pretrained('/tmp/converted/', clean_up_tokenization_spaces=True)
>>> print(tokenizer.eos_token_id)
2
# USBHost
>>> tokenizer = LlamaTokenizer.from_pretrained('/tmp/Safe-LLaMA-HF-v2 (4-04-23)/llama-7b', clean_up_tokenization_spaces=True)
>>> print(tokenizer.eos_token_id)
0
And here are the contents of the files:
special_tokens_map.json
Mine:
{
"bos_token": {
"content": "<s>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "</s>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
},
"unk_token": {
"content": "<unk>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
}
}
USBHost:
{}
tokenizer_config.json
Mine:
{
"add_bos_token": true,
"add_eos_token": false,
"bos_token": {
"__type": "AddedToken",
"content": "<s>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
},
"clean_up_tokenization_spaces": false,
"eos_token": {
"__type": "AddedToken",
"content": "</s>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
},
"model_max_length": 1000000000000000019884624838656,
"pad_token": null,
"sp_model_kwargs": {},
"tokenizer_class": "LlamaTokenizer",
"unk_token": {
"__type": "AddedToken",
"content": "<unk>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
}
}
USBHost:
{"bos_token": "", "eos_token": "", "model_max_length": 1000000000000000019884624838656, "tokenizer_class": "LlamaTokenizer", "unk_token": ""}
Cc @USBHost @Ph0rk0z
Sadly the issue persists with
gpt4-x-alpaca-13b-native-4bit-128g
in the a3085db commit
Have you tried removing the non-cuda .pt file out of the directory and only having the cuda version? That solved the gibberish for me.
Have you tried removing the non-cuda .pt file out of the directory and only having the cuda version? That solved the gibberish for me.
Oh my.. I though I did but I deleted the cuda.pt instead. Well I'm downloading it again. Thank you
@oobabooga I had the same issue (generating random text after finishing up the prompt) using decapoda-research/llama-7b-hf
with mmosiolek/polpaca-lora-7b
on 3080Ti. I assumed it was an issue with LORA, but it stopped happening with your special_tokens_map.json
and tokenizer_config.json
from an hour ago. I also tried it just now with tloen/alpaca-lora-7b
lora, same issue with original jsons, your ones fix it
The patch seems to be working. Thank you so much. I'm getting sense out of gpt4-x-alpaca.. woot
So... tldr new transformer breaks quants. the patch is to change the contents of special_tokens_map.json and tokenizer_config.json to match content of ooba here https://github.com/oobabooga/text-generation-webui/issues/931#issuecomment-1501259027 ?
You may need to re-download and overwrite the tokenizer files. The transformers library changed the format in a way that requires reconversion.
https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#option-1-pre-converted-weights
I used them from here https://huggingface.co/chavinlo/gpt4-x-alpaca and everything works fine.
So I guess just paste in those 2 files over all my tokenizers and call it a day :)
~~... I'm still getting gibberish~~
I got it by
- downloading the model from https://huggingface.co/Neko-Institute-of-Science/LLaMA-65B-HF/tree/main
- replacing
special_tokens_map.json
andtokenizer_config.json
with the ones here https://huggingface.co/chavinlo/gpt4-x-alpaca
I replaced all the tokenizers on my llama models, including alpaca-native and it all seems to be working now. No gibberish or too much hallucination.. at least in chat mode.
Relevant: https://huggingface.co/hf-internal-testing/llama-tokenizer/tree/main Some stuff changed over there in the last few days.
There is also a LlamaTokenizerFast now (no idea what it does) https://huggingface.co/docs/transformers/main/model_doc/llama#transformers.LlamaTokenizerFast
Ok, so I'm trying to gather all info I can about this gibberish issue as it appears to persist for me regardless of tokenizer config as per this comment #1029 with @CryptoRUSHGav mentioning in a followup comment that using the triton branch of GPTQ resolved the problem for him. While this seems like a possible workaround however this comment on #734 seems to indicate that triton is much slower than CUDA so I don't consider that to be a good solution (also am lazy to install WSL, though I will do so anyway at some point to do exhaustive testing).
I am testing on W10, EPYC 7542 x GV100+GTX1080Ti, 64f5c90, fresh install w/ install.bat
The following models perform as expected: anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g anon8231489123/vicuna-13b-GPTQ-4bit-128g MetaIX/Alpaca-30B-Int4 elinas/llama-30b-int4
while these return gibberish or blank gens which leads me to believe there's an issue specifically with the neko model quantization (are these triton only? should i swap branches in GPTQ? guidance/thoughts appreicated) Neko-Institute-of-Science/LLaMA-65B-4bit-128g (gibberish) Neko-Institute-of-Science/LLaMA-30B-4bit-128g (blank)
If anyone has a known 65b/30b LLaMa for use with the latest commit, please point me in the right direction, else I will check out the following as time permits, and once I get some extra ram I will do the conversions myself if I can't get a preconverted one running by then. Cheers folks! hayooucom (download error) maderix (fail, has size mismatch torch.Size([22016, 1]) vs torch.Size([1, 22016]) same as referenced here in #668, will use torrent from that issue next as the other repos i linked here are of a similar vintage) TianXxx kuleshov
@thot-experiment well currently ooba is broken for whatever reason.
while these return gibberish or blank gens
This seems to be more and more of a weird issue. I used to get this exact thing. On one load I would get gibberish... A few more loads after I would get blank generations. Then again I reload again gibberish.
The only way I fixed this issue on my end was to nuke everything. I only kept the model folder.
Also in my tests with Neko's models work on both triton and CUDA from qwopqwop200.
The only way I fix this issue on my end was to nuke everything. I only kept the model folder.
I have done this and the issue persists w/ the Neko models, some other models work fine (as listed here) on the current commit so I wouldn't quite characterize ooba as "broken" but there's definitely something going on. I would be interested in a known working commit to roll back to if anyone knows of one (either of ooba or GPTQ)
so does
Also in my tests with Neko's models work on both triton and CUDA from qwopqwop200.
mean that your previous comment is no longer the case? what is your system/commit?