text-generation-webui
text-generation-webui copied to clipboard
Message token length issue with llama-4bit (7B)
Describe the bug
Almost every message is marked as 200 tokens regardless if it's 1 word or multiple words/sentences.
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
conda activate textgen python3.10 server.py --gptq-bits 4 --model llama-7b-hf --cai-chat --no-stream
Screenshot
Logs
Loading llama-7b-hf...
Found models/llama-7b-4bit.pt
Loading model ...
Done.
Loaded the model in 2.42 seconds.
Loading the extension "gallery"... Ok.
/home/username/.local/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
warnings.warn(value)
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
Output generated in 5.06 seconds (39.52 tokens/s, 200 tokens)
Output generated in 4.73 seconds (42.28 tokens/s, 200 tokens)
Output generated in 5.79 seconds (34.55 tokens/s, 200 tokens)
System Info
operating system: Linux
GPU brand: Nvidia
GPU model: GeForce RTX 3060 12GB
Thank you, I missed that. This should have solved it https://github.com/oobabooga/text-generation-webui/commit/8c8e8b44508972a37fd15d760f9e4214e5105306
I noticed the same behavior with today's release (commit 49c10c5), which seems to be model-dependent: I get a huge speed increase and correct token sizes only when using the ozcur/alpaca-native-4bit model from Hugging Face. With llama-7b-4bit (without group size) and llama-7b-4bit-128g (with group size 128) from the Torrents, it still tends to reach the token limit even with short answers and takes longer to generate the output.
I noticed the same behavior with today's release (commit 49c10c5), which seems to be model-dependent: I get a huge speed increase and correct token sizes only when using the ozcur/alpaca-native-4bit model from Hugging Face. With llama-7b-4bit (without group size) and llama-7b-4bit-128g (with group size 128) from the Torrents, it still tends to reach the token limit even with short answers and takes longer to generate the output.
I have this error too, but i'm noticing that is present only when i use --no_stream arg. have you tried without it?
This is an upstream issue in the transformers library
https://github.com/huggingface/transformers/issues/22436
I noticed the same behavior with today's release (commit 49c10c5), which seems to be model-dependent: I get a huge speed increase and correct token sizes only when using the ozcur/alpaca-native-4bit model from Hugging Face. With llama-7b-4bit (without group size) and llama-7b-4bit-128g (with group size 128) from the Torrents, it still tends to reach the token limit even with short answers and takes longer to generate the output.
I have this error too, but i'm noticing that is present only when i use --no_stream arg. have you tried without it?
can confirm that it only happens with the --no-stream arg for me aswell
When using openai extension, look up completions.py and fix hardcoded value if needed.
generate_params['max_new_tokens'] = 200
Probably better to create separate issue to get value from settings.yaml
This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.