GLM-4 glm4回复一段时间后会出现乱码

System Info / 系統信息

ubuntu20.04 华为mindie环境下

Who can help? / 谁可以帮助到您？

@

Information / 问题信息

[ ] The official example scripts / 官方的示例脚本
[X] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

背景描述：为了兼容华为300I Duo卡，将glm4的代码魔改成了chatglm3（因为现在300I Duo卡不支持glm4，然后将部分glm3-6b的tokenization_chatglm.py代码添加到了glm4的上面），然后开始大模型回答没有问题，但是经过大概10轮的对话之后，生成的结果全是乱码，后面就一直是乱码的情况。使用修改后的glm4权重下的代码tokenization_chatglm.py，代码如下面的回复内容

然后运行glm4会出现上面说的问题。

Expected behavior / 期待表现

找到问题原因并解决乱码的问题

Aug 30 '24 06:08 YanyuanAIMR

import regex as re import base64 import os import json import tiktoken from torch import TensorType from sentencepiece import SentencePieceProcessor

from typing import List, Optional, Union, Dict, Any from transformers import PreTrainedTokenizer from transformers.utils import logging, PaddingStrategy from transformers.tokenization_utils_base import EncodedInput, BatchEncoding

class ChatGLM4Tokenizer(PreTrainedTokenizer): vocab_files_names = {"vocab_file": "tokenizer.model"} model_input_names = ["input_ids", "attention_mask", "position_ids"]

def __init__(
        self,
        vocab_file,
        padding_side="left",
        clean_up_tokenization_spaces=False,
        encode_special_tokens=False,
        **kwargs
):
    self.name = "GLM4Tokenizer"
    self.vocab_file = vocab_file
    pat_str = "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
    self.pat_str = re.compile(pat_str)
    self.encode_special_tokens = encode_special_tokens
    special_tokens = ["[MASK]", "[gMASK]", "[sMASK]", "sop", "eop", "<|system|>", "<|user|>", "<|assistant|>",
                      "<|observation|>"]
    self.special_tokens = {}
    self.index_special_tokens = {}
    self.n_words = 151552
    for token in special_tokens:
        self.special_tokens[token] = self.n_words
        self.index_special_tokens[self.n_words] = token
        self.n_words += 1
    mergeable_ranks = {}
    with open(vocab_file) as f:
        for line in f:
            token, rank = line.strip().split()
            rank = int(rank)
            token = base64.b64decode(token)
            mergeable_ranks[token] = rank

    self.mergeable_ranks = mergeable_ranks

    self.tokenizer = tiktoken.Encoding(
        name="my_tokenizer",
        pat_str=pat_str,
        mergeable_ranks=mergeable_ranks,
        special_tokens={}
    )
    # self.tokenizer = SPTokenizer(vocab_file)
    self.decoder = {rank: token for token, rank in mergeable_ranks.items()}
    self.n_words = len(self.decoder)

    super().__init__(
        padding_side=padding_side,
        clean_up_tokenization_spaces=clean_up_tokenization_spaces,
        **kwargs
    )

@property
def vocab_size(self):
    return self.n_words

def get_vocab(self):
    """ Returns vocab as a dict """
    vocab = {self._convert_id_to_token(i): i for i in range(self.vocab_size)}
    vocab.update(self.added_tokens_encoder)
    return vocab

def get_command(self, token):
    if token in self.special_tokens:
        return self.special_tokens[token]
    assert token in self.tokenizer.special_tokens, f"{token} is not a special token for {self.name}"
    return self.tokenizer.special_tokens[token]


def build_single_message(self, role, metadata, message):
    assert role in ["system", "user", "assistant", "observation"], role
    role_tokens = [self.get_command(f"<|{role}|>")] + self.tokenizer.encode(f"{metadata}\n")
    message_tokens = self.tokenizer.encode(message)
    tokens = role_tokens + message_tokens
    return tokens

def build_chat_input(self, query, history=None, role="user"):
    if history is None:
        history = []
    input_ids = []
    for item in history:
        content = item["content"]
        if item["role"] == "system" and "tools" in item:
            content = content + "\n" + json.dumps(item["tools"], indent=4, ensure_ascii=False)
        input_ids.extend(self.build_single_message(item["role"], item.get("metadata", ""), content))
    input_ids.extend(self.build_single_message(role, "", query))
    input_ids.extend([self.get_command("<|assistant|>")])
    return self.batch_encode_plus([input_ids], return_tensors="pt", is_split_into_words=True)

def convert_tokens_to_string(self, tokens: List[Union[bytes, str]]) -> str:
    """
    Converts a sequence of tokens in a single string.
    """
    text = ""
    temp = b""
    for t in tokens:
        if isinstance(t, str):
            if temp:
                text += temp.decode("utf-8", errors="replace")
                temp = b""
            text += t
        elif isinstance(t, bytes):
            temp += t
        else:
            raise TypeError("token should only be of type types or str")
    if temp:
        text += temp.decode("utf-8", errors="replace")
    return text

def _tokenize(self, text, **kwargs):
    tokens = []
    ids = self.tokenizer.encode(text)
    for t in ids:
        tokens.append(self.decoder[t])
    return tokens

def _convert_token_to_id(self, token):
    """ Converts a token (str) in an id using the vocab. """
    return self.mergeable_ranks[token]

def _convert_id_to_token(self, index):
    """Converts an index (integer) in a token (str) using the vocab."""
    return self.decoder.get(index, "")

def save_vocabulary(self, save_directory, filename_prefix=None):
    """
    Save the vocabulary and special tokens file to a directory.

    Args:
        save_directory (`str`):
            The directory in which to save the vocabulary.
        filename_prefix (`str`, *optional*):
            An optional prefix to add to the named of the saved files.

    Returns:
        `Tuple(str)`: Paths to the files saved.
    """
    if os.path.isdir(save_directory):
        vocab_file = os.path.join(
            save_directory, self.vocab_files_names["vocab_file"]
        )
    else:
        vocab_file = save_directory

    with open(self.vocab_file, 'rb') as fin:
        proto_str = fin.read()

    with open(vocab_file, "wb") as writer:
        writer.write(proto_str)

    return (vocab_file,)

def get_prefix_tokens(self):
    prefix_tokens = [self.convert_tokens_to_ids("[gMASK]"), self.convert_tokens_to_ids("<sop>")]
    return prefix_tokens

def build_single_message(self, role, metadata, message, tokenize=True):
    assert role in ["system", "user", "assistant", "observation"], role
    if tokenize:
        role_tokens = [self.convert_tokens_to_ids(f"<|{role}|>")] + self.tokenizer.encode(f"{metadata}\n",
                                                                                          disallowed_special=())
        message_tokens = self.tokenizer.encode(message, disallowed_special=())
        tokens = role_tokens + message_tokens
        return tokens
    else:
        return str(f"<|{role}|>{metadata}\n{message}")
def build_inputs_with_special_tokens(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
    """
    Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
    adding special tokens. A BERT sequence has the following format:

    - single sequence: `[CLS] X [SEP]`
    - pair of sequences: `[CLS] A [SEP] B [SEP]`

    Args:
        token_ids_0 (`List[int]`):
            List of IDs to which the special tokens will be added.
        token_ids_1 (`List[int]`, *optional*):
            Optional second list of IDs for sequence pairs.

    Returns:
        `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
    """
    prefix_tokens = self.get_prefix_tokens()
    token_ids_0 = prefix_tokens + token_ids_0
    if token_ids_1 is not None:
        token_ids_0 = token_ids_0 + token_ids_1 + [self.convert_tokens_to_ids("<eos>")]
    return token_ids_0

def _pad(
        self,
        encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],
        max_length: Optional[int] = None,
        padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
        pad_to_multiple_of: Optional[int] = None,
        return_attention_mask: Optional[bool] = None,
) -> dict:
    """
    Pad encoded inputs (on left/right and up to predefined length or max length in the batch)

    Args:
        encoded_inputs:
            Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`).
        max_length: maximum length of the returned list and optionally padding length (see below).
            Will truncate by taking into account the special tokens.
        padding_strategy: PaddingStrategy to use for padding.

            - PaddingStrategy.LONGEST Pad to the longest sequence in the batch
            - PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
            - PaddingStrategy.DO_NOT_PAD: Do not pad
            The tokenizer padding sides are defined in self.padding_side:

                - 'left': pads on the left of the sequences
                - 'right': pads on the right of the sequences
        pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
            This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
            `>= 7.5` (Volta).
        return_attention_mask:
            (optional) Set to False to avoid returning attention mask (default: set to model specifics)
    """
    # Load from model defaults
    assert self.padding_side == "left"

    required_input = encoded_inputs[self.model_input_names[0]]
    seq_length = len(required_input)

    if padding_strategy == PaddingStrategy.LONGEST:
        max_length = len(required_input)

    if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):
        max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of

    needs_to_be_padded = padding_strategy != PaddingStrategy.DO_NOT_PAD and len(required_input) != max_length

    # Initialize attention mask if not present.
    if "attention_mask" not in encoded_inputs:
        encoded_inputs["attention_mask"] = [1] * seq_length

    if "position_ids" not in encoded_inputs:
        encoded_inputs["position_ids"] = list(range(seq_length))

    if needs_to_be_padded:
        difference = max_length - len(required_input)

        if "attention_mask" in encoded_inputs:
            encoded_inputs["attention_mask"] = [0] * difference + encoded_inputs["attention_mask"]
        if "position_ids" in encoded_inputs:
            encoded_inputs["position_ids"] = [0] * difference + encoded_inputs["position_ids"]
        encoded_inputs[self.model_input_names[0]] = [self.pad_token_id] * difference + required_input

    return encoded_inputs

@property
def default_chat_template(self):
    """
    GLM-4 uses [gMASK] and <sop> to indicate user messages. The system message is included as part of the first user
    message. The assistant messages do not have special tokens, as they can be identified by their order.
    """
    template = (
        "{% if messages[0]['role'] == 'system' %}"
        "{% set loop_messages = messages[1:] %}"  # Extract system message if it's present
        "{% set system_message = messages[0]['content'] %}"
        "{% elif USE_DEFAULT_PROMPT == true and not '[gMASK]' in messages[0]['content'] %}"
        "{% set loop_messages = messages %}"  # Or use the default system message if the flag is set
        "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
        "{% else %}"
        "{% set loop_messages = messages %}"
        "{% set system_message = false %}"
        "{% endif %}"
        "{% for message in loop_messages %}"  # Loop over all non-system messages
        "{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}"
        "{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}"
        "{% endif %}"
        "{% if loop.index0 == 0 and system_message != false %}"  # Embed system message in first message
        "{% set content = '[gMASK]<sop>' + system_message + '\\n' + message['content'] %}"
        "{% else %}"
        "{% set content = message['content'] %}"
        "{% endif %}"
        "{% if message['role'] == 'user' %}"  # Handle user messages
        "{{ content.strip() }}"
        "{% elif message['role'] == 'assistant' %}"  # Handle assistant messages
        "{{ ' '  + content.strip() + ' ' }}"
        "{% endif %}"
        "{% endfor %}"
        "{% if add_generation_prompt %}{% endif %}"
    )
    template = template.replace("USE_DEFAULT_PROMPT", "true" if self.use_default_system_prompt else "false")
    default_message = "你是一个名为 GLM-4 的人工智能助手。你是基于智谱AI训练的语言模型 GLM-4 模型开发的，你的任务是针对用户的问题和要求提供适当的答复和支持。\n\n# 可用工具\n"
    default_message += "\n## python\n\n当你向 `python` 发送包含 Python 代码的消息时，该代码将会在一个有状态的 Jupyter notebook 环境中执行。\n"
    default_message += "`python` 返回代码执行的输出，或在执行 60 秒后返回超时。\n`/mnt/data` 将会持久化存储你的文件。在此会话中，`python` 无法访问互联网。"
    default_message += "不要使用 `python` 进行任何网络请求或者在线 API 调用，这些在线内容的访问将不会成功。"
    default_message += "\n## simple_browser\n\n你可以使用 `simple_browser` 工具。该工具支持以下函数：\n"
    default_message += "`search(query: str, recency_days: int)`：使用搜索引擎进行查询并显示结果，可以使用 `recency_days` 参数控制搜索内容的时效性。\n"
    default_message += "`mclick(ids: list[int])`：获取一系列指定 id 的页面内容。每次调用时，须选择3-10个页面。选择多个角度的页面，同时尽可能选择可信任的信息来源。"
    default_message += "考虑到部分页面是无法加载的，你也可以多打开一些可能有用的页面而不用担心内容过多。\n"
    default_message += "`open_url(url: str)`：打开指定的 URL。\n"
    default_message += "使用 `【{引用 id}†{引用文本}】` 来引用内容。\n\n"
    default_message += "操作步骤：1. 使用 `search` 来获得信息列表; 2. 使用 `mclick` 来获取指定 ID 页面的内容; 3. 根据获得的内容进行回复。"
    default_message += "在回复中应当引用信息来源。\n 如果用户提供了 URL，也可以用 `open_url` 直接打开页面。"
    default_message += "\n如果初次搜索结果没有找到合适的信息，也可以再次使用 `search` 进行搜索。\n"
    default_message += "## cogview\n\n如果用户的请求中包含了对图像的描述，你可以使用 `cogview` 来生成图像并展示给用户。"
    default_message += "你需要向 `cogview` 发送图像描述，规则：\n- 发送给 `cogview` 的消息必须使用英语。用户的中文描述必须完全翻译为英语。\n"
    default_message += "- 应当尽可能详细地描述图像生成的需求，需求描述约 100 英文单词。\n- 保持用户原始描述的意图。不要虚构内容或者没见过的人物。\n"
    default_message += "- 如无特殊说明，所在地为中国，持有中国立场并遵循中国社会主义价值观。"
    default_message = default_message.replace("\n", "\\n").replace("'", "\\'")
    template = template.replace("DEFAULT_SYSTEM_MESSAGE", default_message)

    return template

Aug 30 '24 06:08 YanyuanAIMR

glm4和glm3的分词方式、词表大小都不一样啊

Aug 30 '24 06:08 zhipuch

glm4和glm3的分词方式、词表大小都不一样啊

@zhipuch 哦哦，我解释一下哈，分词方式和词表都用的glm4的原来的内容，只是更改了tokenization_chatglm.py这个代码和部分config.json的内容，目的是为了让mindie可以通过调用原来glm3的接口来运行glm4。然后让我很困惑的是，开始的10几个回复都还是正常的，后面突然就乱码了，然后再提问就一直乱码。这个原因不是很清楚

Aug 30 '24 08:08 YanyuanAIMR

也就是说，你使用的是chatglm3-6b，然后tokenizer部分改成glm4的方式了对吗？

Aug 30 '24 08:08 zhipuch

也就是说，你使用的是chatglm3-6b，然后tokenizer部分改成glm4的方式了对吗？

不是，模型和几乎所有的文件都是glm4的，然后只有两个文件改了，一个config.json（改了名字），另一个是tokenization_chatglm.py（追加了部分glm3的内容，比如build_chat_input等），其他都是glm4的东西

Aug 30 '24 11:08 YanyuanAIMR

为什么conig.json不能用glm-4的，这个改了名字是？另外，把乱码的部分复制到这里吧

Aug 31 '24 17:08 zRzRzRzRzRzRzR

为什么conig.json不能用glm-4的，这个改了名字是？另外，把乱码的部分复制到这里吧

@zRzRzRzRzRzRzR 因为目前华为的推理框架还没支持glm4的模型，只支持了chatglm3，然后乱码如下：

Sep 06 '24 11:09 YanyuanAIMR

我们正在适配

Sep 14 '24 10:09 zRzRzRzRzRzRzR