ChatGLM2-6B tokenizer无法加入special token

tokenizer无法加入special token

Open icemoon-creative opened this issue 1 year ago • 1 comments

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

''' from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("/data/chatglm/chatglm2-6b", trust_remote_code=True) encoded_input = tokenizer.encode('你好',add_special_tokens=False) tokenizer.unk_token_id = 0 tokenizer.add_special_tokens({ "eos_token": "", "bos_token": "", "unk_token": "", }) encoded_input2 = tokenizer.encode('你好',add_special_tokens=True) ''' encoded_input1 为 [36474,54591] encoded_input2 为 [64790,64792,36474,54591]

Expected Behavior

为什么add_special_tokens为True时，并没有加上eos_token和bos_token,求问如何解决.

Steps To Reproduce

代码如上所示

Environment

- OS:ubuntu 30.04
- Python: 3.10
- Transformers:transformers=4.30.0
- PyTorch: 2.0.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

No response

Jun 29 '23 08:06 icemoon-creative

请问下词表的大小只有64790，为什么会出现id为64790,64792？

Jun 30 '23 09:06 liu-nlper

不支持更改自动添加的 special tokens，因为要和训练的时候保持一致

Jul 05 '23 13:07 duzx16

ChatGLM2-6B ChatGLM2-6B copied to clipboard

tokenizer无法加入special token

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

ChatGLM2-6B
ChatGLM2-6B copied to clipboard