stanford_alpaca
stanford_alpaca copied to clipboard
[PAD] never used
Hi, thank you for the nice repo!
while looking at the train.py I recognized that [PAD] is added as a special token to the tokenizer and the model embeddings. When padding the data in the collator tokenizer.pad_token_id is used, however, decoding tokenizer.pad_token_id does yield 0 as id which when being encoded again returns <unk>. Is this the intended behavior or a bug?
@timongurcke I found that batch decoding leads to a error, I think it is caused by 'pad'
outputs = model.generate(**batch, max_new_tokens=training_args.gen_length, do_sample=False, num_return_sequences=1)
@Syno8 maybe you should add tokenizer.padding_side='left'