unsloth Inconsistent Tokenizer Padding Behavior in Unsloth

Inconsistent Tokenizer Padding Behavior in Unsloth

Open JhonDan1999 opened this issue 1 year ago • 4 comments

I've encountered an issue with inconsistent Gemma tokenizer padding in Unsloth.

As you can see in this code snippet:

Screenshot 2024-03-28 at 12 01 13 AM

I have three tokenizers for the Gemma model. tokenizer and tokenizer_4bit (loaded from the Hugging Face library) exhibit left padding, while tokenizer_FastLanguageModel (loaded from Unsloth) has right padding.

I believe this inconsistency indicates a potential issue within the Unsloth library.

Mar 27 '24 21:03 JhonDan1999

@JhonDan1999 The padding side of "right" is for training only - this is not a bug. For inference, one must use padding "left" or else you'll get wrong results.

Mar 28 '24 14:03 danielhanchen

thank you for the response Danial.

But I'm still confused could you please explain why there is a difference between the padding side used during training ("right") versus inference ("left")? The documentation doesn't seem to cover this, so additional context would be really helpful.

Another point I've noticed that the LLM's behavior changes significantly depending on whether padding is included or not. When I don't include padding, the model generates nonsensical text. But if I add padding (even when it may not be necessary), the model generates the desired output. Do you have any insights into why the presence or absence of padding has such a drastic effect on the model's behavior?

Mar 28 '24 16:03 JhonDan1999

@JhonDan1999 Yes so left is only used for generation. If you use right, you'll get gibberish.

This only is important for batched decoding, and not single decoding.

Mar 28 '24 17:03 danielhanchen

I auto change the padding side to fix this issue

May 17 '24 17:05 danielhanchen

unsloth unsloth copied to clipboard

Inconsistent Tokenizer Padding Behavior in Unsloth

unsloth
unsloth copied to clipboard