Qwen-VL [BUG] 无网络环境下模型 tokenizer 无法加载，错误为读取不到模型目录下的 SimSun.ttf 文件

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

在无网络环境下，遵循 Tutorial.md 使用 AutoTokenizer 函数加载时报错： requests.exceptions.ConnectionError: HTTPSConnectionPool(host='qianwen-res.oss-cn-beijing.aliyuncs.com', port=443): Max retries exceeded with url: /Qwen-VL/assets/SimSun.ttf (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f2568cbd610>: Failed to establish a new connection: [Errno -2] Name or service not known')) 这导致 tokenizer 无法加载，进而使模型不可用。

期望行为 | Expected Behavior

模型即便在无网络环境下也应当正常运行，AutoTokenizer 应该首先读取模型目录下是否存在 SimSun.ttf 文件。

复现方法 | Steps To Reproduce

无网络环境下，进入 Python3.8 CLI，使用 Tutorial.md 所述代码，到：tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)时，替换其中 "Qwen/Qwen-VL-Chat" 为本地模型目录，报错。

运行环境 | Environment

- OS:Ubuntu 20.04
- Python:3.8.10
- Transformers:4.37.2
- PyTorch:2.2.0
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):12.1

备注 | Anything else?

错误显示无法从网络加载 SimSun.ttf 文件。该文件在 modelscope 版本的 QwenVL 模型文件目录下存在，但 tokenization_qwen.py 无法获取。相关源码如下（29~35行）：

VOCAB_FILES_NAMES = {"vocab_file": "qwen.tiktoken", "ttf": "SimSun.ttf"}
FONT_PATH = try_to_load_from_cache("Qwen/Qwen-VL-Chat", "SimSun.ttf")   #这里应该改改
if FONT_PATH is None:
    if not os.path.exists("SimSun.ttf"):
        ttf = requests.get("https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/SimSun.ttf")
        open("SimSun.ttf", "wb").write(ttf.content)
    FONT_PATH = "SimSun.ttf"

源码中似乎指定了该 ttf 文件要由 hf cache 文件夹读取，若不存在则从网络获取。无网络且加载非 cache 目录模型文件时显然读取不到。

Mar 04 '24 07:03 Lanture1064

Same, you might copy this Simsun.ttf file to your root path.

Mar 14 '24 08:03 LeslieWongCV

I don't how the from_pretrained method moves the required files from local_model_path to hg_cache_folder, it appears that the SimSun.ttf wasn't moved to the hg_cache_folder even if it already exist in the local model repo.

So, I did the following step in order to make things work:

Add the following code before any AutoModelForCausalLM.from_pretrained() method

# check if SimSun.ttf exist
from transformers.utils.hub import (
    HF_MODULES_CACHE,
    TRANSFORMERS_DYNAMIC_MODULE_NAME
)
qwen_vl_submodule = os.path.join(HF_MODULES_CACHE, TRANSFORMERS_DYNAMIC_MODULE_NAME,
                                 os.path.basename(model_args.model_name_or_path))

# check if qwen_vl_submodule exists, if not, create it
hack_simsun = os.path.join(qwen_vl_submodule, "SimSun.ttf")

print("Check if SimSun.ttf exists")
import subprocess
subprocess.run(f"ls -l {qwen_vl_submodule}", shell=True)
if not os.path.exists(hack_simsun):
    import shutil
    print("SimSun.ttf not found, copying from pretrained model")
    pretrained_hack_simsun = os.path.join(model_args.model_name_or_path, "SimSun.ttf")
    shutil.copy(pretrained_hack_simsun, hack_simsun)

Modify the tokenization_qwen.py from the local Qwen-VL model repo (this file should also be in the remote model repo). Change the logic of `FONT_PATH

FONT_PATH = try_to_load_from_cache("Qwen/Qwen-VL-Chat", "SimSun.ttf")
print(f"my_test {os.getcwd()}")
print(f"my file test{os.path.abspath(__file__)}")
if FONT_PATH is None:
    # check cache folder
    cache_simsun = os.path.join(os.path.dirname(os.path.abspath(__file__)),"SimSun.ttf")
    if not os.path.exists(cache_simsun):
        ttf = requests.get("https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/SimSun.ttf")
        open("SimSun.ttf", "wb").write(ttf.content)
    FONT_PATH = cache_simsun

Apr 09 '24 10:04 Brickea

Just comment line 29～35 and don't use try_to_load_from_cache, write the FONT_PATH all by yourself: 干脆注释掉29～35行，也别用try_to_load_from_cache，纯手动把FONT_PATH写了：

FONT_PATH = "xxx/Qwen-VL-Chat/SimSun.ttf"
# if FONT_PATH is None:
#    if not os.path.exists("SimSun.ttf"):
#        ttf = requests.get("https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/SimSun.ttf")
#        open("SimSun.ttf", "wb").write(ttf.content)
#    FONT_PATH = "SimSun.ttf"

If still wrong, just revise FONT_PATH as absolute or relative path. 还报错的话就改FONT_PATH，改成绝对路径或者相对路径。

Apr 09 '24 12:04 digitalbottle

Just comment line 29～35 and don't use try_to_load_from_cache, write the FONT_PATH all by yourself: 干脆注释掉29～35行，也别用try_to_load_from_cache，纯手动把FONT_PATH写了：
FONT_PATH = "xxx/Qwen-VL-Chat/SimSun.ttf"
# if FONT_PATH is None:
#    if not os.path.exists("SimSun.ttf"):
#        ttf = requests.get("https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/SimSun.ttf")
#        open("SimSun.ttf", "wb").write(ttf.content)
#    FONT_PATH = "SimSun.ttf"
If still wrong, just revise FONT_PATH as absolute or relative path. 还报错的话就改FONT_PATH，改成绝对路径或者相对路径。

你好，好像我每次运行都会自动生成这个tokenization_qwen.py文件，因此改了没有效果

Jul 25 '24 03:07 humphreyde

Just comment line 29～35 and don't use try_to_load_from_cache, write the FONT_PATH all by yourself: 干脆注释掉29～35行，也别用try_to_load_from_cache，纯手动把FONT_PATH写了：
FONT_PATH = "xxx/Qwen-VL-Chat/SimSun.ttf"
# if FONT_PATH is None:
#    if not os.path.exists("SimSun.ttf"):
#        ttf = requests.get("https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/SimSun.ttf")
#        open("SimSun.ttf", "wb").write(ttf.content)
#    FONT_PATH = "SimSun.ttf"
If still wrong, just revise FONT_PATH as absolute or relative path. 还报错的话就改FONT_PATH，改成绝对路径或者相对路径。
你好，好像我每次运行都会自动生成这个tokenization_qwen.py文件，因此改了没有效果

tokenization_qwen.py自动被重置？那可能是tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)中的trust_remote_code=True设定导致的，你存一份改过的文件到tokenization_qwen.py.bak(防止被覆盖了反复改麻烦），然后用False试下？

Jul 25 '24 06:07 digitalbottle

你可以和qwen-vl-chat-int4中的tokenization_qwen.py进行对比，在int4版本中是没有你说的29-35行的内容，所以我的建议是直接全部注释掉？

Jul 25 '24 07:07 thu-yn

嗯嗯，总之没法连外网的server就避开requests就是了，这个字体应该是用来给圈定的图片内容写标记文本的，模型目录已经有字体文件了还去外网请求没啥必要。

Jul 25 '24 07:07 digitalbottle

Just comment line 29～35 and don't use try_to_load_from_cache, write the FONT_PATH all by yourself: 干脆注释掉29～35行，也别用try_to_load_from_cache，纯手动把FONT_PATH写了：
FONT_PATH = "xxx/Qwen-VL-Chat/SimSun.ttf"
# if FONT_PATH is None:
#    if not os.path.exists("SimSun.ttf"):
#        ttf = requests.get("https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/SimSun.ttf")
#        open("SimSun.ttf", "wb").write(ttf.content)
#    FONT_PATH = "SimSun.ttf"
If still wrong, just revise FONT_PATH as absolute or relative path. 还报错的话就改FONT_PATH，改成绝对路径或者相对路径。
你好，好像我每次运行都会自动生成这个tokenization_qwen.py文件，因此改了没有效果
tokenization_qwen.py自动被重置？那可能是tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)中的trust_remote_code=True设定导致的，你存一份改过的文件到tokenization_qwen.py.bak(防止被覆盖了反复改麻烦），然后用False试下？

改成False后，它引发了错误：Tokenizer class QWenTokenizer does not exist or is not currently import. 貌似只有True才能成功通过这条语句

Jul 25 '24 10:07 humphreyde

你可以和qwen-vl-chat-int4中的tokenization_qwen.py进行对比，在int4版本中是没有你说的29-35行的内容，所以我的建议是直接全部注释掉？

我用的是int4版本的，在window环境运行，然后代码是：

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

# 如果您希望结果可复现，可以设置随机数种子。
# torch.manual_seed(1234)

tokenizer = AutoTokenizer.from_pretrained("D:\\xxx\\Qwen-VL-Chat-int4", trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained("D:\\xxx\\Qwen-VL-Chat-int4", device_map="cuda", trust_remote_code=True).eval()

query = tokenizer.from_list_format([
    {'image': 'assets/mm_tutorial/Rebecca_(1939_poster).jpeg'},
    {'text': 'What is the name of the movie in the poster?'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)

貌似它每次运行时，都会在我的C\\User\\xxx\\.cache\\huggingface\\modules\\transformer_modules\\Qwen-vl-chat-int4目录下生成tokenization_qwen.py文件，然后这个文件的29-35行就会有上述代码，每次运行都自动生成，因此我改了也没有效果。请问是哪里能够修改它的生成，或者该如何解决这种问题，谢谢。

Jul 25 '24 10:07 humphreyde

嗯嗯，总之没法连外网的server就避开requests就是了，这个字体应该是用来给圈定的图片内容写标记文本的，模型目录已经有字体文件了还去外网请求没啥必要。

我不知道它在哪里生成了这个脚本并自动去请求外网了

Jul 25 '24 10:07 humphreyde

咦，那就还是加载失败了，因为QWenTokenizer这个类就是在你修改过的tokenization_qwen.py中定义的；按说qwen不会不支持加载本地分词器啊，本地的模型文件目录下面东西是齐的不，qwen.tiktoken、tokenization_qwen.py、tokenizer_config.json啥的都在？这可有点玄学了Orz

Jul 25 '24 10:07 digitalbottle

咦，那就还是加载失败了，因为QWenTokenizer这个类就是在你修改过的tokenization_qwen.py中定义的；按说qwen不会不支持加载本地分词器啊，本地的模型文件目录下面东西是齐的不，qwen.tiktoken、tokenization_qwen.py、tokenizer_config.json啥的都在？这可有点玄学了Orz

跑通了！你提到模型目录下的tokenization_qwen.py，我没意识到是这个文件夹下的，我一直改的是c盘下面的那个，不好意思，感谢感谢！

Jul 26 '24 00:07 humphreyde

咦，那就还是加载失败了，因为QWenTokenizer这个类就是在你修改过的tokenization_qwen.py中定义的；按说qwen不会不支持加载本地分词器啊，本地的模型文件目录下面东西是齐的不，qwen.tiktoken、tokenization_qwen.py、tokenizer_config.json啥的都在？这可有点玄学了Orz

跑通了！你提到模型目录下的tokenization_qwen.py，我没意识到是这个文件夹下的，我一直改的是c盘下面的那个，不好意思，感谢感谢！

你好你好，遇到了同样的问题，我试图去改transformers/models/qwen2/tokenization_qwen.py里的东西，但发现没有那个在.cache 中的代码？似乎都没有涉及到ttf的内容，请问你是改动了哪个文件呢

Jul 26 '24 02:07 another1s

咦，那就还是加载失败了，因为QWenTokenizer这个类就是在你修改过的tokenization_qwen.py中定义的；按说qwen不会不支持加载本地分词器啊，本地的模型文件目录下面东西是齐的不，qwen.tiktoken、tokenization_qwen.py、tokenizer_config.json啥的都在？这可有点玄学了Orz

跑通了！你提到模型目录下的tokenization_qwen.py，我没意识到是这个文件夹下的，我一直改的是c盘下面的那个，不好意思，感谢感谢！

你好你好，遇到了同样的问题，我试图去改transformers/models/qwen2/tokenization_qwen.py里的东西，但发现没有那个在.cache 中的代码？似乎都没有涉及到ttf的内容，请问你是改动了哪个文件呢

就是你下载的模型文件里面的tokenization_qwen.py，29-35行，参考以上

Jul 26 '24 03:07 humphreyde

咦，那就还是加载失败了，因为QWenTokenizer这个类就是在你修改过的tokenization_qwen.py中定义的；按说qwen不会不支持加载本地分词器啊，本地的模型文件目录下面东西是齐的不，qwen.tiktoken、tokenization_qwen.py、tokenizer_config.json啥的都在？这可有点玄学了Orz

跑通了！你提到模型目录下的tokenization_qwen.py，我没意识到是这个文件夹下的，我一直改的是c盘下面的那个，不好意思，感谢感谢！

你好你好，遇到了同样的问题，我试图去改transformers/models/qwen2/tokenization_qwen.py里的东西，但发现没有那个在.cache 中的代码？似乎都没有涉及到ttf的内容，请问你是改动了哪个文件呢

就是你下载的模型文件里面的tokenization_qwen.py，29-35行，参考以上

噢我明白了，感谢！

Jul 26 '24 05:07 another1s

Qwen-VL Qwen-VL copied to clipboard

[BUG] 无网络环境下模型 tokenizer 无法加载，错误为读取不到模型目录下的 SimSun.ttf 文件

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

运行环境 | Environment

备注 | Anything else?

Qwen-VL
Qwen-VL copied to clipboard