stanford_alpaca icon indicating copy to clipboard operation
stanford_alpaca copied to clipboard

[PAD] never used

Open timonziegenbein opened this issue 2 years ago • 2 comments

Hi, thank you for the nice repo!

while looking at the train.py I recognized that [PAD] is added as a special token to the tokenizer and the model embeddings. When padding the data in the collator tokenizer.pad_token_id is used, however, decoding tokenizer.pad_token_id does yield 0 as id which when being encoded again returns <unk>. Is this the intended behavior or a bug?

timonziegenbein avatar Mar 20 '23 15:03 timonziegenbein

@timongurcke I found that batch decoding leads to a error, I think it is caused by 'pad'

outputs = model.generate(**batch, max_new_tokens=training_args.gen_length, do_sample=False, num_return_sequences=1)

Syno8 avatar Mar 24 '23 11:03 Syno8

@Syno8 maybe you should add tokenizer.padding_side='left'

renke999 avatar Apr 15 '23 15:04 renke999