text-generation-webui LLaMA encoder (decoder?) is confused by " 's" (space then quotation mark then character "s")

LLaMA encoder (decoder?) is confused by " 's" (space then quotation mark then character "s")

Open AstraliteHeart opened this issue 1 year ago • 2 comments

tldr:

In [1]: decode(encode("A 's")[0]) == "A 's"
Out[1]: False

In [2]: decode(encode("A 'u")[0]) == "A 'u"
Out[2]: True

I discovered this issue trying to figure out why my output is always one character cut of in the beginning.

In my prompt I have a sentence like Agent will never use words like 'super bad word' which ends up being decoded as Agent will never use words like'super bad word'.

It looks like any sequence of space + single quotation mark + character s ends up being stripped of space: A 'u => A 'u but A 's =>A's

to be honest I am pretty confused as both seem to produce 3 token sequences, with A 'u not generating a space or space then quotation mark tokens and yet displaying a space.

319 => |A|
525 => |'|
29879 => |s|

319 => |A|
525 => |'|
29884 => |u|

Tested this on 8bit on with 65B, 8bit on/off with 30B.

P.S. I suspect the issue is somewhere in the transformers implementation, but this is the most likely place people will find it so we can start it form here.

Mar 12 '23 01:03 AstraliteHeart

I am not sure if there is a strong contract around encode-decode roundtrip returning the same output but the code definitely relies on such behavior, I made a temporary fix like this:

if not (shared.args.chat or shared.args.cai_chat):
  q_len = len(decode(input_ids[0]))
  reply = original_question + apply_extensions(reply[q_len:], "output")

in https://github.com/oobabooga/text-generation-webui/blob/main/modules/text_generation.py#L176

Mar 12 '23 02:03 AstraliteHeart

Maybe this is the space problem from here: https://github.com/huggingface/transformers/pull/21955

Mar 12 '23 02:03 Ph0rk0z

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

Apr 11 '23 23:04 github-actions[bot]

text-generation-webui text-generation-webui copied to clipboard

LLaMA encoder (decoder?) is confused by " 's" (space then quotation mark then character "s")

text-generation-webui
text-generation-webui copied to clipboard