ChatGLM2-6B icon indicating copy to clipboard operation
ChatGLM2-6B copied to clipboard

关于tokenization的疑惑

Open diuzi opened this issue 1 year ago • 2 comments

Is your feature request related to a problem? Please describe.

No response

Solutions

tokenizer = ChatGLMTokenizer.from_pretrained(path)
print(tokenizer.tokenizer.special_tokens)
# {'[MASK]': 64789, '[gMASK]': 64790, '[sMASK]': 64791, 'sop': 64792, 'eop': 64793}

print(tokenizer.pad_token_id, tokenizer.bos_token_id, tokenizer.eos_token_id)
# 0 None 2

print(tokenizer.pad_token, tokenizer.bos_token, tokenizer.eos_token)
# <unk> None </s>

print(tokenizer.tokenizer.eos_id, tokenizer.tokenizer.pad_id, tokenizer.tokenizer.bos_id)
# 2 0 1

print(tokenizer.get_prefix_tokens(), tokenizer.get_command('<eos>'))
# [64790, 64792] 2

Additional context

想问下,SFT训练,编码过程start, end分别用的那个, ChatGLMTokenizer.build_inputs_with_special_tokens的处理逻辑和bos_token_id,eos_token_id让人看起来有些迷惑。[GMask]也是需要添加的吗

diuzi avatar Jul 04 '23 06:07 diuzi

同问,chatglm和v2的差异难道是特别的设计?特别是:

v2里不存在tokenizer_v2.bos_tokentokenizer_v2.bos_token_id v2里 tokenizer_v2.pad_token == '<unk>; tokenizer_v2.pad_token_id == 0',而tokenizer_v2.special_tokens里是{'': 1, '': 2, '': 0}`

上述测试结果基于commit: 74d61a69043828fd740df7d2e75b1b55b988f06e

nlp4whp avatar Jul 04 '23 07:07 nlp4whp

看起来是sentencepiece训练的时候禁用了pad_id(-1),所以把unk_id(0)作为了tokenizer的pad_id,很奇怪的操作

LotuSrc avatar Jul 06 '23 08:07 LotuSrc

请问prefix_token是什么啊?

Aran-Guo avatar Oct 26 '23 06:10 Aran-Guo