ChatGLM2-6B icon indicating copy to clipboard operation
ChatGLM2-6B copied to clipboard

[Help] <bos_token_id in ChatGLM, but not in ChatGLM2>

Open mathCrazyy opened this issue 1 year ago • 10 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

在代码liucongg/ChatGLM-Finetuning 中替换为ChatGLM2后发现没有bos_token_id了,求助应该怎么解决呢

Expected Behavior

no

Steps To Reproduce

发生了如下错误: context_length = input_ids.index(tokenizer.bos_token_id) ValueError: None is not in list

Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

no

mathCrazyy avatar Jun 26 '23 15:06 mathCrazyy

nice

xxm1668 avatar Jun 27 '23 01:06 xxm1668

我将代码修改成这样的了:context_length = len(a_ids)

Pyjacc avatar Jun 27 '23 02:06 Pyjacc

我将代码修改成这样的了:context_length = len(a_ids)

是的,我也这么改了。不知道为啥会没有bos_token_id

5663015 avatar Jun 28 '23 16:06 5663015

同问,为什么没有bos_token_id。另外eos_token_id和pad_token_id为啥相等,都是2?

johnnywuj81 avatar Jun 29 '23 06:06 johnnywuj81

我将代码修改成这样的了:context_length = len(a_ids)

我后来改成了 context_length = input_ids.index(tokenizer._convert_token_to_id("sop")),能通过了,但是在model_engine, optimizer, _, _ = deepspeed.initialize(config=conf, model=model, model_parameters=model.parameters()) 时 又遇到了torch.cat tensor为空的情况..有没有相似的呢

mathCrazyy avatar Jun 29 '23 13:06 mathCrazyy

同问,为什么没有bos_token_id。另外eos_token_id和pad_token_id为啥相等,都是2?

有bos_id,不过没发现对应的special token,我把代码改成下面了 tokens = prompt_tokens + src_tokens + ["[gMASK]", "sop"] + tgt_tokens + ["eop"]
input_ids = tokenizer.convert_tokens_to_ids(tokens) context_length = input_ids.index(tokenizer._convert_token_to_id("sop"))

mathCrazyy avatar Jun 29 '23 13:06 mathCrazyy

同问,为什么没有bos_token_id。另外eos_token_id和pad_token_id为啥相等,都是2?

有bos_id,不过没发现对应的special token,我把代码改成下面了 tokens = prompt_tokens + src_tokens + ["[gMASK]", "sop"] + tgt_tokens + ["eop"] input_ids = tokenizer.convert_tokens_to_ids(tokens) context_length = input_ids.index(tokenizer._convert_token_to_id("sop"))

@mathCrazyy 我打印出来bos_id是None?

5663015 avatar Jun 30 '23 17:06 5663015

将 代码 ChatGLM-6B/ptuning/main.py 中的 context_length = input_ids.index(tokenizer.bos_token_id) 修改为: context_length = input_ids.index(tokenizer.get_command("sop"))

另外,将 模型 chatglm2-6b/tokenization_chatglm.py 中的 token_ids_0 = prefix_tokens + token_ids_0 修改为: token_ids_0 = token_ids_0 + prefix_tokens

hanwei2008 avatar Jul 04 '23 05:07 hanwei2008

SPTokenizer里边有定义bos_id, eos_id,和transformers默认的bos_token_id,eos_token_id命名不一致,乐

yongzhuo avatar Jul 06 '23 06:07 yongzhuo

将 代码 ChatGLM-6B/ptuning/main.py 中的 context_length = input_ids.index(tokenizer.bos_token_id) 修改为: context_length = input_ids.index(tokenizer.get_command("sop"))

另外,将 模型 chatglm2-6b/tokenization_chatglm.py 中的 token_ids_0 = prefix_tokens + token_ids_0 修改为: token_ids_0 = token_ids_0 + prefix_tokens

但是假设官方指令微调的时候也是这个代码,本来不正确的,你改了效果反而变差,只能将错就错【dog】?

yongzhuo avatar Jul 06 '23 06:07 yongzhuo

从build_inputs_with_special_tokens代码来看,这里就是个bug...

    if token_ids_1 is not None:
        token_ids_0 = token_ids_0 + token_ids_1 + [self.get_command("<eos>")]

改成 len(a_ids) 应该可以

adhb22 avatar Jul 12 '23 09:07 adhb22