ChatGLM2-6B icon indicating copy to clipboard operation
ChatGLM2-6B copied to clipboard

no special tokens

Open shibing624 opened this issue 2 years ago • 5 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

eos token is none

Expected Behavior

No response

Steps To Reproduce

SFT model

Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

cat tokenizer_config.json
{ "name_or_path": "THUDM/chatglm2-6b", "bos_token": "", "eop_token": "", "eos_token": "", "gmask_token": "[gMASK]", "mask_token": "[MASK]", "pad_token": "", "unk_token": "", "remove_space": false, "do_lower_case": false, "tokenizer_class": "ChatGLMTokenizer", "num_image_tokens": 0, "auto_map": { "AutoTokenizer": [ "tokenization_chatglm.ChatGLMTokenizer", null ] } }

shibing624 avatar Jun 26 '23 12:06 shibing624

suggest add special tokens:

> cat tokenizer_config.json                                          
{
  "name_or_path": "THUDM/chatglm2-6b",
  "bos_token": "<sop>",
  "eop_token": "<eop>",
  "eos_token": "</s>",
  "gmask_token": "[gMASK]",
  "mask_token": "[MASK]",
  "pad_token": "<pad>",
  "unk_token": "<unk>",
  "remove_space": false,
  "do_lower_case": false,
  "tokenizer_class": "ChatGLMTokenizer",
  "num_image_tokens": 0,
  "auto_map": {
    "AutoTokenizer": [
      "tokenization_chatglm.ChatGLMTokenizer",
      null
      ]
  }
}

shibing624 avatar Jun 26 '23 12:06 shibing624

没有添加主要是因为添加的话,用户输入中含有这些文本的时候,tokeknizer 会把它们 encode 成具有特殊含义的 id,造成预期之外的效果。

duzx16 avatar Jun 26 '23 14:06 duzx16

那只能在instruction finetuning时,手动加

tokenizer.add_special_tokens({
            "eos_token": "</s>",
            "bos_token": "<sop>",
            "unk_token": "<unk>",
        })

shibing624 avatar Jun 27 '23 02:06 shibing624

看到ChatGLM2的词表中存在:

  "<unk>": 0,
  "<s>": 1,
  "</s>": 2,

所以这几个token实际上没有用到对吗?

另外请问训练时候使用的bos、eos token和ChatGLM一样是<sop><eop>吗?

Randool avatar Jun 27 '23 03:06 Randool

我打印tokenizer.unk_token、eos_token、bos_token 和其对应的id时是None

shibing624 avatar Jun 27 '23 04:06 shibing624