stanford_alpaca icon indicating copy to clipboard operation
stanford_alpaca copied to clipboard

special tokens are not correctly set in tokenizer

Open gongliym opened this issue 1 year ago • 2 comments

I found the special tokens are not correctly set in tokenizer when i'm using decapoda-research/llama-7b-hf. Here is the code from train.py line198-206

special_tokens_dict = dict()
if tokenizer.pad_token is None:
    special_tokens_dict["pad_token"] = DEFAULT_PAD_TOKEN
if tokenizer.eos_token is None:
    special_tokens_dict["eos_token"] = DEFAULT_EOS_TOKEN
if tokenizer.bos_token is None:
    special_tokens_dict["bos_token"] = DEFAULT_BOS_TOKEN
if tokenizer.unk_token is None:
    special_tokens_dict["unk_token"] = DEFAULT_UNK_TOKEN

And the pretrained tokenizer config is as follow:

{"bos_token": "", "eos_token": "", "model_max_length": 1000000000000000019884624838656, "tokenizer_class": "LLaMATokenizer", "unk_token": ""}

Since the bos_token/eos_token/unk_token are empty string but not None, they are not been set.

gongliym avatar Apr 27 '23 11:04 gongliym

Same problem, do you solve it?

minglii1998 avatar May 31 '23 03:05 minglii1998

@gongliym @minglii1998 have you solved this problem?

yxchng avatar Jun 25 '23 02:06 yxchng