stanford_alpaca icon indicating copy to clipboard operation
stanford_alpaca copied to clipboard

Are special tokens wrong?

Open gauss-clb opened this issue 1 year ago • 0 comments

In the vocab of llama, eos_token is "</s>", bos_token is "<s>", unk_token is "<unk>", and the corresponding token ids are 0, 1, 2. So I think in train.py, line 214-221 should be removed.

And are DEFAULT_BOS_TOKEN and DEFAULT_UNK_TOKEN wrong?

And for line 151, should we add a space between example['output'] and tokenizer.eos_token?

gauss-clb avatar Apr 08 '23 15:04 gauss-clb