Langchain-Chatchat data did not match any variant of untagged enum PyPreTokenizerTypeWrapper

问题描述 / Problem Description 用简洁明了的语言描述这个问题 / Describe the problem in a clear and concise manner. Image build 成功，container启动失败。能帮我看下这是哪里出现问题了吗？我一直没有定位到是哪里出现了问题，或者能帮我看下这是哪个模块提示的错误

复现问题的步骤 / Steps to Reproduce ==============================Langchain-Chatchat Configuration============================== 操作系统：Linux-5.15.0-76-generic-x86_64-with-glibc2.29. python版本：3.8.10 (default, Nov 22 2023, 10:22:35) [GCC 9.4.0] 项目版本：v0.2.10 langchain版本：0.0.344. fastchat版本：0.2.36

当前使用的分词器：ChineseRecursiveTextSplitter 当前启动的LLM模型：['CodeQwen1.5-7B-Chat', 'openai-api'] @ cuda {'device': 'cuda', 'host': '0.0.0.0', 'infer_turbo': False, 'model_path': '/opt/models/CodeQwen1.5-7B-Chat', 'model_path_exists': True, 'port': 20002} {'api_base_url': 'https://api.openai.com/v1', 'api_key': '', 'device': 'auto', 'host': '0.0.0.0', 'infer_turbo': False, 'model_name': 'gpt-3.5-turbo', 'online_api': True, 'openai_proxy': '', 'port': 20002} 当前Embbedings模型： bge-large-en-v1.5 @ cuda ==============================Langchain-Chatchat Configuration============================== 2024-04-28 06:29:43,432 - startup.py[line:650] - INFO: 正在启动服务： 2024-04-28 06:29:43,433 - startup.py[line:651] - INFO: 如需查看 llm_api 日志，请前往 /opt/Langchain-ChatChat/logs 2024-04-28 06:29:48 | ERROR | stderr | INFO: Started server process [475] 2024-04-28 06:29:48 | ERROR | stderr | INFO: Waiting for application startup. 2024-04-28 06:29:48 | ERROR | stderr | INFO: Application startup complete. 2024-04-28 06:29:48 | ERROR | stderr | INFO: Uvicorn running on http://0.0.0.0:20000 (Press CTRL+C to quit) 2024-04-28 06:29:48 | INFO | model_worker | Loading the model ['CodeQwen1.5-7B-Chat'] on worker 131939df ... Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|███████████████████████████████████████████████████████████████████████████▎ | 1/4 [00:00<00:02, 1.13it/s] Loading checkpoint shards: 50%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 2/4 [00:01<00:01, 1.10it/s] Loading checkpoint shards: 75%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 3/4 [00:02<00:00, 1.07it/s] Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00, 1.21it/s] Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00, 1.16it/s] 2024-04-28 06:29:54 | ERROR | stderr | 2024-04-28 06:29:54 | ERROR | stderr | Process model_worker - CodeQwen1.5-7B-Chat: 2024-04-28 06:29:54 | ERROR | stderr | Traceback (most recent call last): 2024-04-28 06:29:54 | ERROR | stderr | File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap 2024-04-28 06:29:54 | ERROR | stderr | self.run() 2024-04-28 06:29:54 | ERROR | stderr | File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run 2024-04-28 06:29:54 | ERROR | stderr | self._target(*self._args, **self._kwargs) 2024-04-28 06:29:54 | ERROR | stderr | File "/opt/Langchain-ChatChat/startup.py", line 386, in run_model_worker 2024-04-28 06:29:54 | ERROR | stderr | app = create_model_worker_app(log_level=log_level, **kwargs) 2024-04-28 06:29:54 | ERROR | stderr | File "/opt/Langchain-ChatChat/startup.py", line 214, in create_model_worker_app 2024-04-28 06:29:54 | ERROR | stderr | worker = ModelWorker( 2024-04-28 06:29:54 | ERROR | stderr | File "/usr/local/lib/python3.8/dist-packages/fastchat/serve/model_worker.py", line 77, in init 2024-04-28 06:29:54 | ERROR | stderr | self.model, self.tokenizer = load_model( 2024-04-28 06:29:54 | ERROR | stderr | File "/usr/local/lib/python3.8/dist-packages/fastchat/model/model_adapter.py", line 353, in load_model 2024-04-28 06:29:54 | ERROR | stderr | model, tokenizer = adapter.load_model(model_path, kwargs) 2024-04-28 06:29:54 | ERROR | stderr | File "/usr/local/lib/python3.8/dist-packages/fastchat/model/model_adapter.py", line 1706, in load_model 2024-04-28 06:29:54 | ERROR | stderr | tokenizer = AutoTokenizer.from_pretrained( 2024-04-28 06:29:54 | ERROR | stderr | File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/tokenization_auto.py", line 862, in from_pretrained 2024-04-28 06:29:54 | ERROR | stderr | return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) 2024-04-28 06:29:54 | ERROR | stderr | File "/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py", line 2089, in from_pretrained 2024-04-28 06:29:54 | ERROR | stderr | return cls._from_pretrained( 2024-04-28 06:29:54 | ERROR | stderr | File "/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py", line 2311, in _from_pretrained 2024-04-28 06:29:54 | ERROR | stderr | tokenizer = cls(*init_inputs, **init_kwargs) 2024-04-28 06:29:54 | ERROR | stderr | File "/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_fast.py", line 111, in init 2024-04-28 06:29:54 | ERROR | stderr | fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file) 2024-04-28 06:29:54 | ERROR | stderr | Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 12564 column 3

环境信息 / Environment Information

langchain-ChatGLM 版本/commit 号： ==============================Langchain-Chatchat Configuration============================== 操作系统：Linux-5.15.0-76-generic-x86_64-with-glibc2.29. python版本：3.8.10 (default, Nov 22 2023, 10:22:35) [GCC 9.4.0] 项目版本：v0.2.10 langchain版本：0.0.344. fastchat版本：0.2.36
是否使用 Docker 部署（是/否）：是 / Is Docker deployment used (yes/no): yes
使用的模型（ChatGLM2-6B / Qwen-7B 等）：CodeQwen1.5-7B-Chat
使用的 Embedding 模型（moka-ai/m3e-base 等）：bge-large-en-v1.5
使用的向量库类型 (faiss / milvus / pg_vector 等)：milvus
操作系统及版本 / Operating system and version:
Python 版本 / Python version:
其他相关环境信息 / Other relevant environment information: Nvidia GPU A30

附加信息 / Additional Information 添加与问题相关的任何其他信息 / Add any other information related to the issue.

Apr 28 '24 06:04 jiusi9

The error "data did not match any variant of untagged enum PyPreTokenizerTypeWrapper" you're encountering is likely due to a mismatch or misconfiguration in the tokenizer setup within your Docker-deployed project. This issue can arise from several factors, including Python version discrepancies, compatibility issues between Langchain, Fastchat, and the tokenizer, or incorrect tokenizer configuration.

Check Python Version Compatibility: Your environment mentions Python 3.8.10, but ensure that this version is indeed compatible with the Langchain and Fastchat versions you're using. If your Dockerfile specifies a different Python version (e.g., Python 3.11), adjust it to match the required Python version for your project dependencies.
Review Dependency Versions: Ensure that the versions of Langchain (0.0.344) and Fastchat (0.2.36) are compatible with each other and with the CodeQwen1.5-7B-Chat model. It's crucial to verify that these versions work well together and with the specific tokenizer you're using.
Tokenizer Configuration: The error suggests an issue with the tokenizer's initialization. Double-check the tokenizer configuration, especially how the PyPreTokenizerTypeWrapper is expected to be set up for your project's version of Langchain and Fastchat. Make sure the configuration aligns with the expected format.
Update and Consult Documentation: If the issue persists, consider updating your dependencies to the latest compatible versions. Additionally, consult the documentation for Langchain, Fastchat, and the tokenizer for any specific setup or compatibility notes. Community forums or GitHub issues for these projects may also offer insights from users who have encountered similar problems.

For the ChineseRecursiveTextSplitter tokenizer, ensure it's correctly configured to use a source from "huggingface" and specify the correct tokenizer_name_or_path if necessary. This setup is crucial for the tokenizer to function correctly within your project.

Rebuild your Docker image after making any changes to ensure they are applied. If you need further assistance or have more questions, feel free to ask.

Details

Sources

About Dosu This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

Apr 28 '24 06:04 dosubot[bot]

可能是 CodeQwen 不兼容新版 tokenizers. 你使用的 transformers 和 tokenizers 是什么版本？

See here Qwen/CodeQwen1.5-7B-Chat · Having trouble loading this with transformers

Apr 29 '24 01:04 jwang-paradise

可能是 CodeQwen 不兼容新版 tokenizers. 你使用的 transformers 和 tokenizers 是什么版本？

See here Qwen/CodeQwen1.5-7B-Chat · Having trouble loading this with transformers

和这个issus中一样，我试下其他版本 tokenizers: 0.19.1 transformers: 4.40.1

你觉得这个可能是llm model和transformers 和 tokenizers兼容问题吗？我也搞不清和embedding model bge-large-en-v1.5有没有关系，反正是用 BAAI的bge-large那几个模型都是同样的报错

Apr 29 '24 02:04 jiusi9

This issue was resolved. I downgraded the transformers and tokenizers version.

Now, the version is: tokenizers: 0.15.2 transformers: 4.38.2 accelerate: 0.25.0

Apr 30 '24 01:04 jiusi9

https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat/commit/91ffe86a74d00f76a75371d58a70ae5fe1bc0f29

这个 commit 应该已经解决了该问题。

Jun 28 '24 05:06 imClumsyPanda

Langchain-Chatchat Langchain-Chatchat copied to clipboard

data did not match any variant of untagged enum PyPreTokenizerTypeWrapper

Details

Langchain-Chatchat
Langchain-Chatchat copied to clipboard