stanford_alpaca
stanford_alpaca copied to clipboard
special tokens are not correctly set in tokenizer
I found the special tokens are not correctly set in tokenizer when i'm using decapoda-research/llama-7b-hf
.
Here is the code from train.py line198-206
special_tokens_dict = dict()
if tokenizer.pad_token is None:
special_tokens_dict["pad_token"] = DEFAULT_PAD_TOKEN
if tokenizer.eos_token is None:
special_tokens_dict["eos_token"] = DEFAULT_EOS_TOKEN
if tokenizer.bos_token is None:
special_tokens_dict["bos_token"] = DEFAULT_BOS_TOKEN
if tokenizer.unk_token is None:
special_tokens_dict["unk_token"] = DEFAULT_UNK_TOKEN
And the pretrained tokenizer config is as follow:
{"bos_token": "", "eos_token": "", "model_max_length": 1000000000000000019884624838656, "tokenizer_class": "LLaMATokenizer", "unk_token": ""}
Since the bos_token/eos_token/unk_token are empty string but not None, they are not been set.
Same problem, do you solve it?
@gongliym @minglii1998 have you solved this problem?