ChatGLM2-6B [Help] <bos_token_id in ChatGLM, but not in ChatGLM2>

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

在代码liucongg/ChatGLM-Finetuning 中替换为ChatGLM2后发现没有bos_token_id了，求助应该怎么解决呢

Expected Behavior

no

Steps To Reproduce

发生了如下错误: context_length = input_ids.index(tokenizer.bos_token_id) ValueError: None is not in list

Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

no

Jun 26 '23 15:06 mathCrazyy

nice

Jun 27 '23 01:06 xxm1668

我将代码修改成这样的了：context_length = len(a_ids)

Jun 27 '23 02:06 Pyjacc

我将代码修改成这样的了：context_length = len(a_ids)

是的，我也这么改了。不知道为啥会没有bos_token_id

Jun 28 '23 16:06 5663015

同问，为什么没有bos_token_id。另外eos_token_id和pad_token_id为啥相等，都是2？

Jun 29 '23 06:06 johnnywuj81

我将代码修改成这样的了：context_length = len(a_ids)

我后来改成了 context_length = input_ids.index(tokenizer._convert_token_to_id("sop"))，能通过了，但是在model_engine, optimizer, _, _ = deepspeed.initialize(config=conf, model=model, model_parameters=model.parameters()) 时又遇到了torch.cat tensor为空的情况..有没有相似的呢

Jun 29 '23 13:06 mathCrazyy

同问，为什么没有bos_token_id。另外eos_token_id和pad_token_id为啥相等，都是2？

有bos_id，不过没发现对应的special token，我把代码改成下面了 tokens = prompt_tokens + src_tokens + ["[gMASK]", "sop"] + tgt_tokens + ["eop"]
input_ids = tokenizer.convert_tokens_to_ids(tokens) context_length = input_ids.index(tokenizer._convert_token_to_id("sop"))

Jun 29 '23 13:06 mathCrazyy

同问，为什么没有bos_token_id。另外eos_token_id和pad_token_id为啥相等，都是2？

有bos_id，不过没发现对应的special token，我把代码改成下面了 tokens = prompt_tokens + src_tokens + ["[gMASK]", "sop"] + tgt_tokens + ["eop"] input_ids = tokenizer.convert_tokens_to_ids(tokens) context_length = input_ids.index(tokenizer._convert_token_to_id("sop"))

@mathCrazyy 我打印出来bos_id是None？

Jun 30 '23 17:06 5663015

将代码 ChatGLM-6B/ptuning/main.py 中的 context_length = input_ids.index(tokenizer.bos_token_id) 修改为： context_length = input_ids.index(tokenizer.get_command("sop"))

另外，将模型 chatglm2-6b/tokenization_chatglm.py 中的 token_ids_0 = prefix_tokens + token_ids_0 修改为： token_ids_0 = token_ids_0 + prefix_tokens

Jul 04 '23 05:07 hanwei2008

SPTokenizer里边有定义bos_id, eos_id，和transformers默认的bos_token_id,eos_token_id命名不一致，乐

Jul 06 '23 06:07 yongzhuo

将代码 ChatGLM-6B/ptuning/main.py 中的 context_length = input_ids.index(tokenizer.bos_token_id) 修改为： context_length = input_ids.index(tokenizer.get_command("sop"))

另外，将模型 chatglm2-6b/tokenization_chatglm.py 中的 token_ids_0 = prefix_tokens + token_ids_0 修改为： token_ids_0 = token_ids_0 + prefix_tokens

但是假设官方指令微调的时候也是这个代码，本来不正确的，你改了效果反而变差，只能将错就错【dog】？

Jul 06 '23 06:07 yongzhuo

从build_inputs_with_special_tokens代码来看，这里就是个bug...

    if token_ids_1 is not None:
        token_ids_0 = token_ids_0 + token_ids_1 + [self.get_command("<eos>")]

改成 len(a_ids) 应该可以

Jul 12 '23 09:07 adhb22

ChatGLM2-6B ChatGLM2-6B copied to clipboard

[Help] <bos_token_id in ChatGLM, but not in ChatGLM2>

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

ChatGLM2-6B
ChatGLM2-6B copied to clipboard