unsloth icon indicating copy to clipboard operation
unsloth copied to clipboard

Inconsistent Tokenizer Padding Behavior in Unsloth

Open JhonDan1999 opened this issue 1 year ago • 4 comments

I've encountered an issue with inconsistent Gemma tokenizer padding in Unsloth.

As you can see in this code snippet:

Screenshot 2024-03-28 at 12 01 13 AM

I have three tokenizers for the Gemma model. tokenizer and tokenizer_4bit (loaded from the Hugging Face library) exhibit left padding, while tokenizer_FastLanguageModel (loaded from Unsloth) has right padding.

I believe this inconsistency indicates a potential issue within the Unsloth library.

JhonDan1999 avatar Mar 27 '24 21:03 JhonDan1999

@JhonDan1999 The padding side of "right" is for training only - this is not a bug. For inference, one must use padding "left" or else you'll get wrong results.

danielhanchen avatar Mar 28 '24 14:03 danielhanchen

thank you for the response Danial.

But I'm still confused could you please explain why there is a difference between the padding side used during training ("right") versus inference ("left")? The documentation doesn't seem to cover this, so additional context would be really helpful.

Another point I've noticed that the LLM's behavior changes significantly depending on whether padding is included or not. When I don't include padding, the model generates nonsensical text. But if I add padding (even when it may not be necessary), the model generates the desired output. Do you have any insights into why the presence or absence of padding has such a drastic effect on the model's behavior?

JhonDan1999 avatar Mar 28 '24 16:03 JhonDan1999

@JhonDan1999 Yes so left is only used for generation. If you use right, you'll get gibberish.

This only is important for batched decoding, and not single decoding.

danielhanchen avatar Mar 28 '24 17:03 danielhanchen

I auto change the padding side to fix this issue

danielhanchen avatar May 17 '24 17:05 danielhanchen