Open-Assistant
Open-Assistant copied to clipboard
Tokenizers padding_side was not validate to be "right" in trainer_sft.py
from transformers import AutoTokenizer
AutoTokenizer.from_pretrained("OpenAssistant/llama2-13b-orca-8k-3319").padding_side
>> 'left'
AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-fp16")
>> 'left'
AutoTokenizer.from_pretrained("mosaicml/mpt-7b").padding_side
>> 'right'
AutoTokenizer.from_pretrained("huggyllama/llama-7b").padding_side
>> 'left'
AutoTokenizer.from_pretrained("OpenAssistant/llama-30b-sft-v8.2-2.4k-steps-system").padding_side
>> 'left'
Since llama models are using left padding, the supervised training dialoguecollator would cause the label_mask to pad in a different direction as the tokenizer.pad (input_ids, attention_mask), as torch.stack (label_mask) implements the right padding strategy.
Printing out the dataloader results in trainer_sft.py would also verify the issue
train_dataloader = DataLoader(train, collate_fn=train_collate_fn, batch_size=9, shuffle=True)
for batch in train_dataloader:
for idx, question in enumerate(batch['input_ids']):
print('-------')
print(tokenizer.decode(question[batch['label_masks'][idx]]).replace('</s>', '')+'\n')
I think there's no padding_side assigned to right in the trainer_sft.py pipeline, so by default llama models we have trained are bit faulty
An easy fix would be setting padding_side = 'left' in DialogueDataCollator post_init function
@dataclass
class DialogueDataCollator:
...
def __post_init__(self):
assert self.tokenizer.eos_token
self.tokenizer.padding_side = 'right'