no special tokens
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
eos token is none
Expected Behavior
No response
Steps To Reproduce
SFT model
Environment
- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :
Anything else?
cat tokenizer_config.json
{ "name_or_path": "THUDM/chatglm2-6b", "bos_token": "", "eop_token": " ", "eos_token": "", "gmask_token": "[gMASK]", "mask_token": "[MASK]", "pad_token": " ", "unk_token": " ", "remove_space": false, "do_lower_case": false, "tokenizer_class": "ChatGLMTokenizer", "num_image_tokens": 0, "auto_map": { "AutoTokenizer": [ "tokenization_chatglm.ChatGLMTokenizer", null ] } }
suggest add special tokens:
> cat tokenizer_config.json
{
"name_or_path": "THUDM/chatglm2-6b",
"bos_token": "<sop>",
"eop_token": "<eop>",
"eos_token": "</s>",
"gmask_token": "[gMASK]",
"mask_token": "[MASK]",
"pad_token": "<pad>",
"unk_token": "<unk>",
"remove_space": false,
"do_lower_case": false,
"tokenizer_class": "ChatGLMTokenizer",
"num_image_tokens": 0,
"auto_map": {
"AutoTokenizer": [
"tokenization_chatglm.ChatGLMTokenizer",
null
]
}
}
没有添加主要是因为添加的话,用户输入中含有这些文本的时候,tokeknizer 会把它们 encode 成具有特殊含义的 id,造成预期之外的效果。
那只能在instruction finetuning时,手动加
tokenizer.add_special_tokens({
"eos_token": "</s>",
"bos_token": "<sop>",
"unk_token": "<unk>",
})
看到ChatGLM2的词表中存在:
"<unk>": 0,
"<s>": 1,
"</s>": 2,
所以这几个token实际上没有用到对吗?
另外请问训练时候使用的bos、eos token和ChatGLM一样是<sop>和<eop>吗?
我打印tokenizer.unk_token、eos_token、bos_token 和其对应的id时是None