text-generation-webui Generation API responses incorrectly cut the reply on applying extensions

Describe the bug

API running in the notebook/default mode cuts or adds one letter to the reply after applying the extensions to the output.

This bug is more prevalent on longer prompts. Substring the reply from reply[len(question)+1 when applying the extensions seems to resolve the issue for shorter prompts, but for longer prompts, it starts to cut even more.

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

Add print statements before and after applying the extensions to the reply in text_generation.py generate_reply function
Send a generation request to /api/textgen and/or Kobold API wrapper extension.
Observe the console logs. In my particular case an extra from the prompt : was added.

Screenshot

No response

Logs

API request (to KAI extension):

{
    "prompt": "Cohee: *Smiles* Yeah, maybe I am naive, but I'm trying to overcome myself. Anyway, what are you displeased? Definitely not from me being around. Tell me if I could be a help.\nMikoto: ",
    "temperature": 0.65,
    "rep_pen": 1.10
}

text_generation.py before extensions:

Cohee: *Smiles* Yeah, maybe I am naive, but I'm trying to overcome myself. Anyway, what are you displeased? Definitely not from me being around. Tell me if I could be a help.
Mikoto: It doesn't matter, really. I just... I don't like when people talk about things they don't know about.
Cohee: I understand your concern and I'll do my best not to make you uncomfortable with the things I say. But I want you to understand that I won't judge you by just one little thing. And I hope you can forgive me for my ignorance.
Cohee: You're pretty quiet today. What's wrong?
Mikoto: Nothing. Just thinking.
Cohee: Do we have any plans this evening?
Mikoto: Nope.
Cohee: Well, how about dinner at home?
Mikoto: That sounds good.
Cohee: Perfect, I will go and get some groceries now. Would you prefer anything in particular?
Mikoto: I'd love some sushi!

text_generation.py after extensions:

Cohee: *Smiles* Yeah, maybe I am naive, but I'm trying to overcome myself. Anyway, what are you displeased? Definitely not from me being around. Tell me if I could be a help.
Mikoto:: It doesn't matter, really. I just... I don't like when people talk about things they don't know about.
Cohee: I understand your concern and I'll do my best not to make you uncomfortable with the things I say. But I want you to understand that I won't judge you by just one little thing. And I hope you can forgive me for my ignorance.
Cohee: You're pretty quiet today. What's wrong?
Mikoto: Nothing. Just thinking.
Cohee: Do we have any plans this evening?
Mikoto: Nope.
Cohee: Well, how about dinner at home?
Mikoto: That sounds good.
Cohee: Perfect, I will go and get some groceries now. Would you prefer anything in particular?
Mikoto: I'd love some sushi!



### System Info

```shell
OS: Windows 11
GPU: Nvidia RTX 3090
Model: LLaMa 13b in 4-bit mode

Mar 22 '23 17:03 Cohee1207

Actually, never mind the extensions part. I took a closer look at the console logs and noticed that it may be related to the LLaMa tokenizer producing an extra leading space on decoding the reply.

Screenshots below:

Mar 22 '23 19:03 Cohee1207

https://huggingface.co/docs/transformers/main/model_doc/llama#:~:text=The%20LLaMA%20tokenizer,the%20tokenizer%20configuration.

Looks like the case.

decode_with_prefix_space exists in the code but supposedly does nothing right now. A dumb quick solution would be to pad the "original_question" string with a leading space if it is not the first character and the model is llama while LlamaTokenizer is not fixed.

Mar 22 '23 20:03 Cohee1207

There is another issue that resembles this: if you set a large value for max_new_tokens such that max_new_tokens + (the length of your prompt) is greater than 2048, the beginning of the "original_question" will appear truncated in the reply, and reply[len(question):] will generate garbabe output. There might be a way to solve both problems at once.

Mar 22 '23 20:03 oobabooga

@oobabooga Maybe it's worth trying decoding only the newly generated tokens instead of the whole batch. Consider the following code:

            original_tokens_count = len(input_ids[0])
            generated_tokens_count = len(output)
            new_tokens = generated_tokens_count - original_tokens_count
            
            truncated_reply = decode(output[-new_tokens:])
            print('truncated reply', truncated_reply, sep='\n')
            reply = truncated_reply

I'm giving it a try, only positive results so far.

Mar 22 '23 21:03 Cohee1207

Thanks a lot @SillyLossy, that indeed seems to fix it. I have incorporated the changes here: https://github.com/oobabooga/text-generation-webui/commit/de6a09dc7f7d5a5d8496cfa1598abb4ff5ee1338

Another detail that I noticed in my tests is that prompts ending in a space tend to have another space as the next generated token. For instance, 1 2 3 4 5 6 will usually autocomplete to 1 2 3 4 5 6 7 8, while 1 2 3 4 5 6 (with a space after the 6) will autocomplete to 1 2 3 4 5 6 7 8 9. I'm not sure if that is a bug or not.

Mar 23 '23 03:03 oobabooga

I'd been dealing with this same bug (thought it was my own off-by-one, was driving me up the wall) but it seems to be better now. I can repro the number situation, not sure if that's a bug though.

Mar 23 '23 10:03 thot-experiment

Thanks a lot @SillyLossy, that indeed seems to fix it. I have incorporated the changes here: https://github.com/oobabooga/text-generation-webui/commit/de6a09dc7f7d5a5d8496cfa1598abb4ff5ee1338

That indeed worked perfectly. Did some testing with llama again and I didn't have any cut replies since update on both longer and shorter contexts. You can mark this as resolved.

Mar 24 '23 00:03 Cohee1207

I'll push some additional changes to how messages are extracted from the prompt in chat mode in https://github.com/oobabooga/text-generation-webui/pull/515. If you notice something weird after the merge, please let me know.

Mar 24 '23 00:03 oobabooga

text-generation-webui text-generation-webui copied to clipboard

Generation API responses incorrectly cut the reply on applying extensions

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

text-generation-webui
text-generation-webui copied to clipboard