alignment-handbook
alignment-handbook copied to clipboard
Minor question about PAD token and EOS token.
Hello,
Thank you for sharing this awesome resource!
I have a question regarding models that already have a chat template like "mistralai/Mistral-7B-Instruct-v0.1". I'm planning on using the non packed dataset. I applied the chat template that comes with the tokenizer as a preprocessing step as suggested. If I decode the samples inside the SFTTrainer after tokenization, they start with two BOS tokens. This is because the tokenizer adds a special token (BOS token in this case because it is set to True in the tokenizer config) in addition to the one in the chat template. To fix this, I need to pass dataset_kwargs={"add_special_tokens": False}
to the SFTTrainer.
Another issue I'm having is that when the pad token is the same as the EOS token, the EOS token label is -100. This might cause the model to continue generating and never stop, right? I'm having this "phenomena" with my finetuned models on my own dataset using the SFT code provided. One workaround would be to code my own data collator that takes this into account instead of using DataCollatorForLanguageModeling
. I also found a related issue on the matter here.
Any comments and guidance are very much appreciated!