ChatGLM2-6B
ChatGLM2-6B copied to clipboard
关于tokenization的疑惑
Is your feature request related to a problem? Please describe.
No response
Solutions
tokenizer = ChatGLMTokenizer.from_pretrained(path)
print(tokenizer.tokenizer.special_tokens)
# {'[MASK]': 64789, '[gMASK]': 64790, '[sMASK]': 64791, 'sop': 64792, 'eop': 64793}
print(tokenizer.pad_token_id, tokenizer.bos_token_id, tokenizer.eos_token_id)
# 0 None 2
print(tokenizer.pad_token, tokenizer.bos_token, tokenizer.eos_token)
# <unk> None </s>
print(tokenizer.tokenizer.eos_id, tokenizer.tokenizer.pad_id, tokenizer.tokenizer.bos_id)
# 2 0 1
print(tokenizer.get_prefix_tokens(), tokenizer.get_command('<eos>'))
# [64790, 64792] 2
Additional context
想问下,SFT训练,编码过程start, end分别用的那个, ChatGLMTokenizer.build_inputs_with_special_tokens的处理逻辑和bos_token_id,eos_token_id让人看起来有些迷惑。[GMask]也是需要添加的吗
同问,chatglm和v2的差异难道是特别的设计?特别是:
v2里不存在tokenizer_v2.bos_token
和tokenizer_v2.bos_token_id
v2里 tokenizer_v2.pad_token == '<unk>; tokenizer_v2.pad_token_id == 0',而
tokenizer_v2.special_tokens里是
{'
上述测试结果基于commit: 74d61a69043828fd740df7d2e75b1b55b988f06e
看起来是sentencepiece训练的时候禁用了pad_id(-1),所以把unk_id(0)作为了tokenizer的pad_id,很奇怪的操作
请问prefix_token是什么啊?