text-generation-webui "Transformers bump" commit ruins gpt4-x-alpaca if using an RTX3090: model loads, but talks gibberish

Describe the bug

Multiple people in the below linked reddit thread mentioning that the model is not working properly on high-end RTX cards. (3090,3090TI,4090) This is seemingly caused by this commit: https://github.com/oobabooga/text-generation-webui/commit/113f94b61ee0e85bd791992da024cb5fc6beac93

I can confirm that using my 3060 and main branch, there's no issue at all - this seemingly only affects higher-end cards. It also worth mentioning, that in the case of the 3060 I'm using auto-devices and --gpu--memory 8.

My case:

3090 24GB 8-bit, model:gpt4-x-alpaca, preset:llama-creative:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Write a poem about the transformers Python library.
Mention the word "large language models" in that poem.
### Response:
, large language models
I'm learning to speak English
I'm trying to learn Python
But I don't know how to code
[...]

There's an active discussion of this on Reddit where I reported it first: https://www.reddit.com/r/Oobabooga/comments/12ez276/3060_vs_3090_same_model_and_presets_but_very/

Seemingly git checkout 5f4f38ca5d11bd1739c0b99e26bb644637a04e0a and pip install protobuf==3.20 resolve the problem.

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

Git clone the newest version of the repo, try to run gpt4-x-alpaca using a 3090,3090TI,4090

Screenshot

Logs

No errors.

System Info

Ubuntu 22.04.2
ASUS TUF RTX 3090 24GB

Apr 08 '23 21:04 ThatCoffeeGuy

You may need to re-download and overwrite the tokenizer files. The transformers library changed the format in a way that requires reconversion.

https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#option-1-pre-converted-weights

Apr 08 '23 21:04 oobabooga

Replacing these 4 files with those from the updated llama-13b conversion fixes the incoherent generation that starts with a comma

tokenizer_config.json
tokenizer.model
special_tokens_map.json
generation_config.json

But then it seems to ignore the eos token and start generating random text after it's finished. Not sure why

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write a poem about the transformers Python library. 
Mention the word "large language models" in that poem.
### Response:
There once was a library named Transformers,
Whose power could not be ignored,
It allowed for manipulation and molding,
Of text data, both large and small.

With the help of this mighty tool,
Developers wrote with gusto,
Creating tasks and scripts so clever,
That even the largest language models would bow. # Hydroptila luctuosa

Hydroptila luctuosa är en nattsländeart som beskrevs av Banks 1904. Hydroptila luctuosa ingår i släktet Hydroptila och familjen ryssjenattsländerna. Inga underarter finns

Apr 08 '23 22:04 oobabooga

I have the same issue, in which the model starts talking nonsense after a successful answer. Seems to happen with different models (Tested with llama-30b-4bit-128g, llama-13b-4bit-128g and Alpaca-30b-4bit-128g). In chat mode it gives a couple of normal answers until then starts spewing some random info (sometimes in polish or french, weirdly)

Feels related to #900 and #860

I am relatively new to his, so I haven't played much with the tool yet, but when I first tried it, some days ago (less than a week) it did not seem to have this issue at all, or at least I cannot recall it.

System Info

OS: Windows 10
GPU: NVIDIA RTX4090 
CPU: AMD 1920x
RAM: 64 GB

Apr 09 '23 05:04 aacuevas

I'm not using text-generation-webui (I'm writing code which imports directly from transformers git repo) but I noticed the same issue. The way I fixed it was to use LlamaTokenizerFast instead of LlamaTokenizer.

LlamaTokenizerFast uses the tokenizers rust library with the same name by huggingface while LlamaTokenizer should use sentencepiece. There is probably some discrepancy there.

(for the records, I got the model from: https://huggingface.co/chavinlo/gpt4-x-alpaca so I think my tokenizers are up to date, I'm running on a 4090)

Apr 09 '23 08:04 framp

I find replacing gpt4-x-alpaca's tokenizer.model tokenizer_config.json special_tokens_map.json with the newly converted tokenizers of llama fixes the problem for me.

Apr 09 '23 09:04 sgsdxzy

Using llama-30b-4bit-128g downloaded here: https://huggingface.co/Neko-Institute-of-Science/LLaMA-30B-4bit-128g

This is getting ridiculous...

Apr 09 '23 11:04 0xbitches

Using gpt4-x-alpaca-13b-native-4bit-128g from https://huggingface.co/anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g/tree/main

Response to "hi":

. when- during on in on but... while............................................... trust.........................................................................................................................h..............

Or this when I replace these files tokenizer_config.json tokenizer.model special_tokens_map.json generation_config.json with llama-13b

Response to "hi":

-21(42-°--2-22-2--2-2 (--22---2-22121212-22-3-1-2-0---2--2----------------2-2---2-2-2---2-----------------------------------------------------------

Installed with the automatic installer

Specs:

OS: Windows 11 2262.1413
GPU: NVIDIA RTX4090 
CPU: AMD 5900X
RAM: 128 GB

vicuna-13b-GPTQ-4bit-128g works fine

Apr 09 '23 16:04 kex0

Is it just this model? Because I gave in, downloaded both versions, then saw it do this.

Maybe related to Act Order + true Sequential + group size together and triton vs cuda.

I have not seen it happen with any others and I didn't update or change anything related to tokenizers.

Screenshot at 2023-04-09 13-37-00

When I used the "cuda" version. But it does hallucinate a lot.

Apr 09 '23 18:04 Ph0rk0z

I did get it working after changing the tokenizer files, but now after responding correctly to the prompt, it keeps generating random text. With that many problems wouldn't it be better to return to the older version of transformers?

Apr 09 '23 18:04 CrazyKrow

The tokenizer is broken in the old version. It adds extra spaces to generations and breaks the stopping_criteria in chat mode.

Apr 09 '23 19:04 oobabooga

Got it. Is there a way to stop the model from generating random text after it has finished responding to the prompt? Because that is the only problem I'm having at the moment, and its not only with this model, it also happens with the Llama13B model when using the Alpaca lora

Apr 09 '23 20:04 CrazyKrow

Using gpt4-x-alpaca-13b-native-4bit-128g from https://huggingface.co/anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g/tree/main

Response to "hi":
. when- during on in on but... while............................................... trust.........................................................................................................................h..............
Or this when I replace these files tokenizer_config.json tokenizer.model special_tokens_map.json generation_config.json with llama-13b

Response to "hi":
-21(42-°--2-22-2--2-2 (--22---2-22121212-22-3-1-2-0---2--2----------------2-2---2-2-2---2-----------------------------------------------------------
Installed with the automatic installer

Specs:
OS: Windows 11 2262.1413
GPU: NVIDIA RTX4090 
CPU: AMD 5900X
RAM: 128 GB
vicuna-13b-GPTQ-4bit-128g works fine

I'm using an RTX 3060 12GB with the gpt-x-alpaca-13b-native-4bit-128g-cuda.pt model and I get almost the exact same output. Has anyone found a working tokenizer that solves this?

Assistant Hello there!

You Hi!

Assistant . when- during onon in. but. while............................................. trust................... (.................................................................................

Apr 09 '23 21:04 rrweller

Good news, I got the CUDA version of gpt4 x alpaca working by removing the gpt-x-alpaca-13b-native-4bit-128g.pt file from the directory and only having the one cuda .pt file present

Edit: While it doesn't spit out gibberish, it often completely misunderstands the prompt and replies with nonsense answers

Apr 09 '23 22:04 rrweller

This seems to have fixed the detailing at the end of generations: https://github.com/oobabooga/text-generation-webui/commit/a3085dba073fe8bdcfb5120729a84560f5d024c3

The question is why setting this manually is necessary. It could be:

A bug in the transformers library
A bug in the converted tokenizer files
Me using the transformers library incorrectly

Apr 10 '23 00:04 oobabooga

Just pulled this change, gpt4-x-alpaca seems to be working much better now. No gibberish and its coherent and actually listens to the prompt

Apr 10 '23 00:04 rrweller

Sadly the issue persists with gpt4-x-alpaca-13b-native-4bit-128g in the https://github.com/oobabooga/text-generation-webui/commit/a3085dba073fe8bdcfb5120729a84560f5d024c3 commit chrome_x92gynfuSh

Apr 10 '23 00:04 kex0

I have reconverted llama-7b and compared the resulting tokenizer files to the ones in Safe-LLaMA-HF-v2 (4-04-23) by @USBHost. Most files are identical, except for two: special_tokens_map.json and tokenizer_config.json.

Here is a comparison between the two conversions:

>>> from transformers import LlamaTokenizer

# My conversion
>>> tokenizer = LlamaTokenizer.from_pretrained('/tmp/converted/', clean_up_tokenization_spaces=True)
>>> print(tokenizer.eos_token_id)
2

# USBHost
>>> tokenizer = LlamaTokenizer.from_pretrained('/tmp/Safe-LLaMA-HF-v2 (4-04-23)/llama-7b', clean_up_tokenization_spaces=True)
>>> print(tokenizer.eos_token_id)
0

And here are the contents of the files:

special_tokens_map.json

Mine:

{
  "bos_token": {
    "content": "<s>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "eos_token": {
    "content": "</s>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "unk_token": {
    "content": "<unk>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  }
}

USBHost:

{}

tokenizer_config.json

Mine:

{
  "add_bos_token": true,
  "add_eos_token": false,
  "bos_token": {
    "__type": "AddedToken",
    "content": "<s>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "clean_up_tokenization_spaces": false,
  "eos_token": {
    "__type": "AddedToken",
    "content": "</s>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "model_max_length": 1000000000000000019884624838656,
  "pad_token": null,
  "sp_model_kwargs": {},
  "tokenizer_class": "LlamaTokenizer",
  "unk_token": {
    "__type": "AddedToken",
    "content": "<unk>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  }
}

USBHost:

{"bos_token": "", "eos_token": "", "model_max_length": 1000000000000000019884624838656, "tokenizer_class": "LlamaTokenizer", "unk_token": ""}

Cc @USBHost @Ph0rk0z

Apr 10 '23 00:04 oobabooga

Sadly the issue persists with gpt4-x-alpaca-13b-native-4bit-128g in the a3085db commit

Have you tried removing the non-cuda .pt file out of the directory and only having the cuda version? That solved the gibberish for me.

Apr 10 '23 00:04 rrweller

Have you tried removing the non-cuda .pt file out of the directory and only having the cuda version? That solved the gibberish for me.

Oh my.. I though I did but I deleted the cuda.pt instead. Well I'm downloading it again. Thank you

Apr 10 '23 00:04 kex0

@oobabooga I had the same issue (generating random text after finishing up the prompt) using decapoda-research/llama-7b-hf with mmosiolek/polpaca-lora-7b on 3080Ti. I assumed it was an issue with LORA, but it stopped happening with your special_tokens_map.json and tokenizer_config.json from an hour ago. I also tried it just now with tloen/alpaca-lora-7b lora, same issue with original jsons, your ones fix it

Apr 10 '23 01:04 Wojtab

The patch seems to be working. Thank you so much. I'm getting sense out of gpt4-x-alpaca.. woot

Apr 10 '23 02:04 Remowylliams

So... tldr new transformer breaks quants. the patch is to change the contents of special_tokens_map.json and tokenizer_config.json to match content of ooba here https://github.com/oobabooga/text-generation-webui/issues/931#issuecomment-1501259027 ?

Apr 10 '23 06:04 practical-dreamer

You may need to re-download and overwrite the tokenizer files. The transformers library changed the format in a way that requires reconversion.

https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#option-1-pre-converted-weights

I used them from here https://huggingface.co/chavinlo/gpt4-x-alpaca and everything works fine.

Apr 10 '23 10:04 phail216

So I guess just paste in those 2 files over all my tokenizers and call it a day :)

Apr 10 '23 12:04 Ph0rk0z

~~... I'm still getting gibberish~~

I got it by

downloading the model from https://huggingface.co/Neko-Institute-of-Science/LLaMA-65B-HF/tree/main
replacing special_tokens_map.json and tokenizer_config.json with the ones here https://huggingface.co/chavinlo/gpt4-x-alpaca

Apr 10 '23 14:04 practical-dreamer

I replaced all the tokenizers on my llama models, including alpaca-native and it all seems to be working now. No gibberish or too much hallucination.. at least in chat mode.

Apr 10 '23 15:04 Ph0rk0z

Relevant: https://huggingface.co/hf-internal-testing/llama-tokenizer/tree/main Some stuff changed over there in the last few days.

Apr 11 '23 01:04 EyeDeck

There is also a LlamaTokenizerFast now (no idea what it does) https://huggingface.co/docs/transformers/main/model_doc/llama#transformers.LlamaTokenizerFast

Apr 11 '23 01:04 oobabooga

Ok, so I'm trying to gather all info I can about this gibberish issue as it appears to persist for me regardless of tokenizer config as per this comment #1029 with @CryptoRUSHGav mentioning in a followup comment that using the triton branch of GPTQ resolved the problem for him. While this seems like a possible workaround however this comment on #734 seems to indicate that triton is much slower than CUDA so I don't consider that to be a good solution (also am lazy to install WSL, though I will do so anyway at some point to do exhaustive testing).

I am testing on W10, EPYC 7542 x GV100+GTX1080Ti, 64f5c90, fresh install w/ install.bat

The following models perform as expected: anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g anon8231489123/vicuna-13b-GPTQ-4bit-128g MetaIX/Alpaca-30B-Int4 elinas/llama-30b-int4

while these return gibberish or blank gens which leads me to believe there's an issue specifically with the neko model quantization (are these triton only? should i swap branches in GPTQ? guidance/thoughts appreicated) Neko-Institute-of-Science/LLaMA-65B-4bit-128g (gibberish) Neko-Institute-of-Science/LLaMA-30B-4bit-128g (blank)

If anyone has a known 65b/30b LLaMa for use with the latest commit, please point me in the right direction, else I will check out the following as time permits, and once I get some extra ram I will do the conversions myself if I can't get a preconverted one running by then. Cheers folks! hayooucom (download error) maderix (fail, has size mismatch torch.Size([22016, 1]) vs torch.Size([1, 22016]) same as referenced here in #668, will use torrent from that issue next as the other repos i linked here are of a similar vintage) TianXxx kuleshov

Apr 11 '23 03:04 thot-experiment

@thot-experiment well currently ooba is broken for whatever reason. Screenshot_20230410-231011~2

Apr 11 '23 04:04 USBhost

while these return gibberish or blank gens

This seems to be more and more of a weird issue. I used to get this exact thing. On one load I would get gibberish... A few more loads after I would get blank generations. Then again I reload again gibberish.

The only way I fixed this issue on my end was to nuke everything. I only kept the model folder.

Apr 11 '23 04:04 USBhost

Also in my tests with Neko's models work on both triton and CUDA from qwopqwop200.

Apr 11 '23 04:04 USBhost

The only way I fix this issue on my end was to nuke everything. I only kept the model folder.

I have done this and the issue persists w/ the Neko models, some other models work fine (as listed here) on the current commit so I wouldn't quite characterize ooba as "broken" but there's definitely something going on. I would be interested in a known working commit to roll back to if anyone knows of one (either of ooba or GPTQ)

so does

Also in my tests with Neko's models work on both triton and CUDA from qwopqwop200.

mean that your previous comment is no longer the case? what is your system/commit?

Apr 11 '23 04:04 thot-experiment

text-generation-webui text-generation-webui copied to clipboard

"Transformers bump" commit ruins gpt4-x-alpaca if using an RTX3090: model loads, but talks gibberish

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

special_tokens_map.json

tokenizer_config.json

text-generation-webui
text-generation-webui copied to clipboard