text-generation-webui Message token length issue with llama-4bit (7B)

Describe the bug

Almost every message is marked as 200 tokens regardless if it's 1 word or multiple words/sentences.

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

conda activate textgen python3.10 server.py --gptq-bits 4 --model llama-7b-hf --cai-chat --no-stream

Screenshot

Screenshot_2023-03-25_15-17-27 Screenshot_2023-03-25_15-16-03

Logs

Loading llama-7b-hf...
Found models/llama-7b-4bit.pt
Loading model ...
Done.
Loaded the model in 2.42 seconds.
Loading the extension "gallery"... Ok.
/home/username/.local/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
  warnings.warn(value)
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Output generated in 5.06 seconds (39.52 tokens/s, 200 tokens)
Output generated in 4.73 seconds (42.28 tokens/s, 200 tokens)
Output generated in 5.79 seconds (34.55 tokens/s, 200 tokens)

System Info

operating system: Linux
GPU brand: Nvidia
GPU model: GeForce RTX 3060 12GB

Mar 25 '23 14:03 1aienthusiast

Thank you, I missed that. This should have solved it https://github.com/oobabooga/text-generation-webui/commit/8c8e8b44508972a37fd15d760f9e4214e5105306

Mar 25 '23 15:03 oobabooga

Thank you, I missed that. This should have solved it 8c8e8b4

doesn't seem like its fixed

Mar 25 '23 17:03 1aienthusiast

I noticed the same behavior with today's release (commit 49c10c5), which seems to be model-dependent: I get a huge speed increase and correct token sizes only when using the ozcur/alpaca-native-4bit model from Hugging Face. With llama-7b-4bit (without group size) and llama-7b-4bit-128g (with group size 128) from the Torrents, it still tends to reach the token limit even with short answers and takes longer to generate the output.

Mar 26 '23 20:03 WolframRavenwolf

I noticed the same behavior with today's release (commit 49c10c5), which seems to be model-dependent: I get a huge speed increase and correct token sizes only when using the ozcur/alpaca-native-4bit model from Hugging Face. With llama-7b-4bit (without group size) and llama-7b-4bit-128g (with group size 128) from the Torrents, it still tends to reach the token limit even with short answers and takes longer to generate the output.

I have this error too, but i'm noticing that is present only when i use --no_stream arg. have you tried without it?

Apr 02 '23 01:04 mvenezia00

This is an upstream issue in the transformers library

https://github.com/huggingface/transformers/issues/22436

Apr 02 '23 04:04 oobabooga

I noticed the same behavior with today's release (commit 49c10c5), which seems to be model-dependent: I get a huge speed increase and correct token sizes only when using the ozcur/alpaca-native-4bit model from Hugging Face. With llama-7b-4bit (without group size) and llama-7b-4bit-128g (with group size 128) from the Torrents, it still tends to reach the token limit even with short answers and takes longer to generate the output.

I have this error too, but i'm noticing that is present only when i use --no_stream arg. have you tried without it?

can confirm that it only happens with the --no-stream arg for me aswell

Apr 03 '23 19:04 1aienthusiast

When using openai extension, look up completions.py and fix hardcoded value if needed. generate_params['max_new_tokens'] = 200

Probably better to create separate issue to get value from settings.yaml

Nov 24 '23 08:11 donjaron777

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

Jan 05 '24 23:01 github-actions[bot]

text-generation-webui text-generation-webui copied to clipboard

Message token length issue with llama-4bit (7B)

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

text-generation-webui
text-generation-webui copied to clipboard