text-generation-webui
text-generation-webui copied to clipboard
LLaMA encoder (decoder?) is confused by " 's" (space then quotation mark then character "s")
tldr:
In [1]: decode(encode("A 's")[0]) == "A 's"
Out[1]: False
In [2]: decode(encode("A 'u")[0]) == "A 'u"
Out[2]: True
I discovered this issue trying to figure out why my output is always one character cut of in the beginning.
In my prompt I have a sentence like Agent will never use words like 'super bad word'
which ends up being decoded as Agent will never use words like'super bad word'
.
It looks like any sequence of space
+ single quotation mark
+ character s
ends up being stripped of space:
A 'u
=> A 'u
but
A 's
=>A's
to be honest I am pretty confused as both seem to produce 3 token sequences, with A 'u
not generating a space
or space then quotation mark
tokens and yet displaying a space.
319 => |A|
525 => |'|
29879 => |s|
319 => |A|
525 => |'|
29884 => |u|
Tested this on 8bit on with 65B, 8bit on/off with 30B.
P.S. I suspect the issue is somewhere in the transformers implementation, but this is the most likely place people will find it so we can start it form here.
I am not sure if there is a strong contract around encode-decode roundtrip returning the same output but the code definitely relies on such behavior, I made a temporary fix like this:
if not (shared.args.chat or shared.args.cai_chat):
q_len = len(decode(input_ids[0]))
reply = original_question + apply_extensions(reply[q_len:], "output")
in https://github.com/oobabooga/text-generation-webui/blob/main/modules/text_generation.py#L176
Maybe this is the space problem from here: https://github.com/huggingface/transformers/pull/21955
This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.