text-generation-webui
text-generation-webui copied to clipboard
Generation API responses incorrectly cut the reply on applying extensions
Describe the bug
API running in the notebook/default mode cuts or adds one letter to the reply after applying the extensions to the output.
This bug is more prevalent on longer prompts. Substring the reply from reply[len(question)+1 when applying the extensions seems to resolve the issue for shorter prompts, but for longer prompts, it starts to cut even more.
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
- Add print statements before and after applying the extensions to the reply in text_generation.py generate_reply function
- Send a generation request to /api/textgen and/or Kobold API wrapper extension.
- Observe the console logs. In my particular case an extra from the prompt
:
was added.
Screenshot
No response
Logs
API request (to KAI extension):
{
"prompt": "Cohee: *Smiles* Yeah, maybe I am naive, but I'm trying to overcome myself. Anyway, what are you displeased? Definitely not from me being around. Tell me if I could be a help.\nMikoto: ",
"temperature": 0.65,
"rep_pen": 1.10
}
text_generation.py before extensions:
Cohee: *Smiles* Yeah, maybe I am naive, but I'm trying to overcome myself. Anyway, what are you displeased? Definitely not from me being around. Tell me if I could be a help.
Mikoto: It doesn't matter, really. I just... I don't like when people talk about things they don't know about.
Cohee: I understand your concern and I'll do my best not to make you uncomfortable with the things I say. But I want you to understand that I won't judge you by just one little thing. And I hope you can forgive me for my ignorance.
Cohee: You're pretty quiet today. What's wrong?
Mikoto: Nothing. Just thinking.
Cohee: Do we have any plans this evening?
Mikoto: Nope.
Cohee: Well, how about dinner at home?
Mikoto: That sounds good.
Cohee: Perfect, I will go and get some groceries now. Would you prefer anything in particular?
Mikoto: I'd love some sushi!
text_generation.py after extensions:
Cohee: *Smiles* Yeah, maybe I am naive, but I'm trying to overcome myself. Anyway, what are you displeased? Definitely not from me being around. Tell me if I could be a help.
Mikoto:: It doesn't matter, really. I just... I don't like when people talk about things they don't know about.
Cohee: I understand your concern and I'll do my best not to make you uncomfortable with the things I say. But I want you to understand that I won't judge you by just one little thing. And I hope you can forgive me for my ignorance.
Cohee: You're pretty quiet today. What's wrong?
Mikoto: Nothing. Just thinking.
Cohee: Do we have any plans this evening?
Mikoto: Nope.
Cohee: Well, how about dinner at home?
Mikoto: That sounds good.
Cohee: Perfect, I will go and get some groceries now. Would you prefer anything in particular?
Mikoto: I'd love some sushi!
### System Info
```shell
OS: Windows 11
GPU: Nvidia RTX 3090
Model: LLaMa 13b in 4-bit mode
Actually, never mind the extensions part. I took a closer look at the console logs and noticed that it may be related to the LLaMa tokenizer producing an extra leading space on decoding the reply.
Screenshots below:
https://huggingface.co/docs/transformers/main/model_doc/llama#:~:text=The%20LLaMA%20tokenizer,the%20tokenizer%20configuration.
Looks like the case.
decode_with_prefix_space exists in the code but supposedly does nothing right now. A dumb quick solution would be to pad the "original_question" string with a leading space if it is not the first character and the model is llama while LlamaTokenizer is not fixed.
There is another issue that resembles this: if you set a large value for max_new_tokens
such that max_new_tokens
+ (the length of your prompt) is greater than 2048, the beginning of the "original_question" will appear truncated in the reply, and reply[len(question):]
will generate garbabe output. There might be a way to solve both problems at once.
@oobabooga Maybe it's worth trying decoding only the newly generated tokens instead of the whole batch. Consider the following code:
original_tokens_count = len(input_ids[0])
generated_tokens_count = len(output)
new_tokens = generated_tokens_count - original_tokens_count
truncated_reply = decode(output[-new_tokens:])
print('truncated reply', truncated_reply, sep='\n')
reply = truncated_reply
I'm giving it a try, only positive results so far.
Thanks a lot @SillyLossy, that indeed seems to fix it. I have incorporated the changes here: https://github.com/oobabooga/text-generation-webui/commit/de6a09dc7f7d5a5d8496cfa1598abb4ff5ee1338
Another detail that I noticed in my tests is that prompts ending in a space tend to have another space as the next generated token. For instance, 1 2 3 4 5 6
will usually autocomplete to 1 2 3 4 5 6 7 8
, while 1 2 3 4 5 6
(with a space after the 6) will autocomplete to 1 2 3 4 5 6 7 8 9
. I'm not sure if that is a bug or not.
I'd been dealing with this same bug (thought it was my own off-by-one, was driving me up the wall) but it seems to be better now. I can repro the number situation, not sure if that's a bug though.
Thanks a lot @SillyLossy, that indeed seems to fix it. I have incorporated the changes here: https://github.com/oobabooga/text-generation-webui/commit/de6a09dc7f7d5a5d8496cfa1598abb4ff5ee1338
That indeed worked perfectly. Did some testing with llama again and I didn't have any cut replies since update on both longer and shorter contexts. You can mark this as resolved.
I'll push some additional changes to how messages are extracted from the prompt in chat mode in https://github.com/oobabooga/text-generation-webui/pull/515. If you notice something weird after the merge, please let me know.