ChatGLM2-6B icon indicating copy to clipboard operation
ChatGLM2-6B copied to clipboard

tokenizer无法加入special token

Open icemoon-creative opened this issue 1 year ago • 1 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

''' from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("/data/chatglm/chatglm2-6b", trust_remote_code=True) encoded_input = tokenizer.encode('你好',add_special_tokens=False) tokenizer.unk_token_id = 0 tokenizer.add_special_tokens({ "eos_token": "", "bos_token": "", "unk_token": "", }) encoded_input2 = tokenizer.encode('你好',add_special_tokens=True) ''' encoded_input1 为 [36474,54591] encoded_input2 为 [64790,64792,36474,54591]

Expected Behavior

为什么add_special_tokens为True时,并没有加上eos_token和bos_token,求问如何解决.

Steps To Reproduce

代码如上所示

Environment

- OS:ubuntu 30.04
- Python: 3.10
- Transformers:transformers=4.30.0
- PyTorch: 2.0.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

No response

icemoon-creative avatar Jun 29 '23 08:06 icemoon-creative

请问下词表的大小只有64790,为什么会出现id为64790,64792?

liu-nlper avatar Jun 30 '23 09:06 liu-nlper

不支持更改自动添加的 special tokens,因为要和训练的时候保持一致

duzx16 avatar Jul 05 '23 13:07 duzx16