Qwen-VL icon indicating copy to clipboard operation
Qwen-VL copied to clipboard

[BUG] 无网络环境下模型 tokenizer 无法加载,错误为读取不到模型目录下的 SimSun.ttf 文件

Open Lanture1064 opened this issue 11 months ago • 15 comments

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • [X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • [X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

在无网络环境下,遵循 Tutorial.md 使用 AutoTokenizer 函数加载时报错: requests.exceptions.ConnectionError: HTTPSConnectionPool(host='qianwen-res.oss-cn-beijing.aliyuncs.com', port=443): Max retries exceeded with url: /Qwen-VL/assets/SimSun.ttf (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f2568cbd610>: Failed to establish a new connection: [Errno -2] Name or service not known')) 这导致 tokenizer 无法加载,进而使模型不可用。

期望行为 | Expected Behavior

模型即便在无网络环境下也应当正常运行,AutoTokenizer 应该首先读取模型目录下是否存在 SimSun.ttf 文件。

复现方法 | Steps To Reproduce

无网络环境下,进入 Python3.8 CLI,使用 Tutorial.md 所述代码,到:tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)时,替换其中 "Qwen/Qwen-VL-Chat" 为本地模型目录,报错。

运行环境 | Environment

- OS:Ubuntu 20.04
- Python:3.8.10
- Transformers:4.37.2
- PyTorch:2.2.0
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):12.1

备注 | Anything else?

错误显示无法从网络加载 SimSun.ttf 文件。该文件在 modelscope 版本的 QwenVL 模型文件目录下存在,但 tokenization_qwen.py 无法获取。相关源码如下(29~35行):

VOCAB_FILES_NAMES = {"vocab_file": "qwen.tiktoken", "ttf": "SimSun.ttf"}
FONT_PATH = try_to_load_from_cache("Qwen/Qwen-VL-Chat", "SimSun.ttf")   #这里应该改改
if FONT_PATH is None:
    if not os.path.exists("SimSun.ttf"):
        ttf = requests.get("https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/SimSun.ttf")
        open("SimSun.ttf", "wb").write(ttf.content)
    FONT_PATH = "SimSun.ttf"

源码中似乎指定了该 ttf 文件要由 hf cache 文件夹读取,若不存在则从网络获取。无网络且加载非 cache 目录模型文件时显然读取不到。

Lanture1064 avatar Mar 04 '24 07:03 Lanture1064

Same, you might copy this Simsun.ttf file to your root path.

LeslieWongCV avatar Mar 14 '24 08:03 LeslieWongCV

I don't how the from_pretrained method moves the required files from local_model_path to hg_cache_folder, it appears that the SimSun.ttf wasn't moved to the hg_cache_folder even if it already exist in the local model repo.

So, I did the following step in order to make things work:

  1. Add the following code before any AutoModelForCausalLM.from_pretrained() method
# check if SimSun.ttf exist
from transformers.utils.hub import (
    HF_MODULES_CACHE,
    TRANSFORMERS_DYNAMIC_MODULE_NAME
)
qwen_vl_submodule = os.path.join(HF_MODULES_CACHE, TRANSFORMERS_DYNAMIC_MODULE_NAME,
                                 os.path.basename(model_args.model_name_or_path))

# check if qwen_vl_submodule exists, if not, create it
hack_simsun = os.path.join(qwen_vl_submodule, "SimSun.ttf")

print("Check if SimSun.ttf exists")
import subprocess
subprocess.run(f"ls -l {qwen_vl_submodule}", shell=True)
if not os.path.exists(hack_simsun):
    import shutil
    print("SimSun.ttf not found, copying from pretrained model")
    pretrained_hack_simsun = os.path.join(model_args.model_name_or_path, "SimSun.ttf")
    shutil.copy(pretrained_hack_simsun, hack_simsun)
  1. Modify the tokenization_qwen.py from the local Qwen-VL model repo (this file should also be in the remote model repo). Change the logic of `FONT_PATH
FONT_PATH = try_to_load_from_cache("Qwen/Qwen-VL-Chat", "SimSun.ttf")
print(f"my_test {os.getcwd()}")
print(f"my file test{os.path.abspath(__file__)}")
if FONT_PATH is None:
    # check cache folder
    cache_simsun = os.path.join(os.path.dirname(os.path.abspath(__file__)),"SimSun.ttf")
    if not os.path.exists(cache_simsun):
        ttf = requests.get("https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/SimSun.ttf")
        open("SimSun.ttf", "wb").write(ttf.content)
    FONT_PATH = cache_simsun

Brickea avatar Apr 09 '24 10:04 Brickea

Just comment line 29~35 and don't use try_to_load_from_cache, write the FONT_PATH all by yourself: 干脆注释掉29~35行,也别用try_to_load_from_cache,纯手动把FONT_PATH写了:

FONT_PATH = "xxx/Qwen-VL-Chat/SimSun.ttf"
# if FONT_PATH is None:
#    if not os.path.exists("SimSun.ttf"):
#        ttf = requests.get("https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/SimSun.ttf")
#        open("SimSun.ttf", "wb").write(ttf.content)
#    FONT_PATH = "SimSun.ttf"

If still wrong, just revise FONT_PATH as absolute or relative path. 还报错的话就改FONT_PATH,改成绝对路径或者相对路径。

digitalbottle avatar Apr 09 '24 12:04 digitalbottle

Just comment line 29~35 and don't use try_to_load_from_cache, write the FONT_PATH all by yourself: 干脆注释掉29~35行,也别用try_to_load_from_cache,纯手动把FONT_PATH写了:

FONT_PATH = "xxx/Qwen-VL-Chat/SimSun.ttf"
# if FONT_PATH is None:
#    if not os.path.exists("SimSun.ttf"):
#        ttf = requests.get("https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/SimSun.ttf")
#        open("SimSun.ttf", "wb").write(ttf.content)
#    FONT_PATH = "SimSun.ttf"

If still wrong, just revise FONT_PATH as absolute or relative path. 还报错的话就改FONT_PATH,改成绝对路径或者相对路径。

你好,好像我每次运行都会自动生成这个tokenization_qwen.py文件,因此改了没有效果

humphreyde avatar Jul 25 '24 03:07 humphreyde

Just comment line 29~35 and don't use try_to_load_from_cache, write the FONT_PATH all by yourself: 干脆注释掉29~35行,也别用try_to_load_from_cache,纯手动把FONT_PATH写了:

FONT_PATH = "xxx/Qwen-VL-Chat/SimSun.ttf"
# if FONT_PATH is None:
#    if not os.path.exists("SimSun.ttf"):
#        ttf = requests.get("https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/SimSun.ttf")
#        open("SimSun.ttf", "wb").write(ttf.content)
#    FONT_PATH = "SimSun.ttf"

If still wrong, just revise FONT_PATH as absolute or relative path. 还报错的话就改FONT_PATH,改成绝对路径或者相对路径。

你好,好像我每次运行都会自动生成这个tokenization_qwen.py文件,因此改了没有效果

tokenization_qwen.py自动被重置?那可能是tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)中的trust_remote_code=True设定导致的,你存一份改过的文件到tokenization_qwen.py.bak(防止被覆盖了反复改麻烦),然后用False试下?

digitalbottle avatar Jul 25 '24 06:07 digitalbottle

你可以和qwen-vl-chat-int4中的tokenization_qwen.py进行对比,在int4版本中是没有你说的29-35行的内容,所以我的建议是直接全部注释掉?

thu-yn avatar Jul 25 '24 07:07 thu-yn

嗯嗯,总之没法连外网的server就避开requests就是了,这个字体应该是用来给圈定的图片内容写标记文本的,模型目录已经有字体文件了还去外网请求没啥必要。

digitalbottle avatar Jul 25 '24 07:07 digitalbottle

Just comment line 29~35 and don't use try_to_load_from_cache, write the FONT_PATH all by yourself: 干脆注释掉29~35行,也别用try_to_load_from_cache,纯手动把FONT_PATH写了:

FONT_PATH = "xxx/Qwen-VL-Chat/SimSun.ttf"
# if FONT_PATH is None:
#    if not os.path.exists("SimSun.ttf"):
#        ttf = requests.get("https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/SimSun.ttf")
#        open("SimSun.ttf", "wb").write(ttf.content)
#    FONT_PATH = "SimSun.ttf"

If still wrong, just revise FONT_PATH as absolute or relative path. 还报错的话就改FONT_PATH,改成绝对路径或者相对路径。

你好,好像我每次运行都会自动生成这个tokenization_qwen.py文件,因此改了没有效果

tokenization_qwen.py自动被重置?那可能是tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)中的trust_remote_code=True设定导致的,你存一份改过的文件到tokenization_qwen.py.bak(防止被覆盖了反复改麻烦),然后用False试下?

改成False后,它引发了错误:Tokenizer class QWenTokenizer does not exist or is not currently import. 貌似只有True才能成功通过这条语句

humphreyde avatar Jul 25 '24 10:07 humphreyde

你可以和qwen-vl-chat-int4中的tokenization_qwen.py进行对比,在int4版本中是没有你说的29-35行的内容,所以我的建议是直接全部注释掉?

我用的是int4版本的,在window环境运行,然后代码是:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

# 如果您希望结果可复现,可以设置随机数种子。
# torch.manual_seed(1234)

tokenizer = AutoTokenizer.from_pretrained("D:\\xxx\\Qwen-VL-Chat-int4", trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained("D:\\xxx\\Qwen-VL-Chat-int4", device_map="cuda", trust_remote_code=True).eval()

query = tokenizer.from_list_format([
    {'image': 'assets/mm_tutorial/Rebecca_(1939_poster).jpeg'},
    {'text': 'What is the name of the movie in the poster?'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)

貌似它每次运行时,都会在我的C\\User\\xxx\\.cache\\huggingface\\modules\\transformer_modules\\Qwen-vl-chat-int4目录下生成tokenization_qwen.py文件,然后这个文件的29-35行就会有上述代码,每次运行都自动生成,因此我改了也没有效果。请问是哪里能够修改它的生成,或者该如何解决这种问题,谢谢。

humphreyde avatar Jul 25 '24 10:07 humphreyde

嗯嗯,总之没法连外网的server就避开requests就是了,这个字体应该是用来给圈定的图片内容写标记文本的,模型目录已经有字体文件了还去外网请求没啥必要。

我不知道它在哪里生成了这个脚本并自动去请求外网了

humphreyde avatar Jul 25 '24 10:07 humphreyde

咦,那就还是加载失败了,因为QWenTokenizer这个类就是在你修改过的tokenization_qwen.py中定义的;按说qwen不会不支持加载本地分词器啊,本地的模型文件目录下面东西是齐的不,qwen.tiktokentokenization_qwen.pytokenizer_config.json啥的都在?这可有点玄学了Orz

digitalbottle avatar Jul 25 '24 10:07 digitalbottle

咦,那就还是加载失败了,因为QWenTokenizer这个类就是在你修改过的tokenization_qwen.py中定义的;按说qwen不会不支持加载本地分词器啊,本地的模型文件目录下面东西是齐的不,qwen.tiktokentokenization_qwen.pytokenizer_config.json啥的都在?这可有点玄学了Orz

跑通了!你提到模型目录下的tokenization_qwen.py,我没意识到是这个文件夹下的,我一直改的是c盘下面的那个,不好意思,感谢感谢!

humphreyde avatar Jul 26 '24 00:07 humphreyde

咦,那就还是加载失败了,因为QWenTokenizer这个类就是在你修改过的tokenization_qwen.py中定义的;按说qwen不会不支持加载本地分词器啊,本地的模型文件目录下面东西是齐的不,qwen.tiktokentokenization_qwen.pytokenizer_config.json啥的都在?这可有点玄学了Orz

跑通了!你提到模型目录下的tokenization_qwen.py,我没意识到是这个文件夹下的,我一直改的是c盘下面的那个,不好意思,感谢感谢!

你好你好,遇到了同样的问题,我试图去改transformers/models/qwen2/tokenization_qwen.py里的东西,但发现没有那个在.cache 中的代码?似乎都没有涉及到ttf的内容,请问你是改动了哪个文件呢

another1s avatar Jul 26 '24 02:07 another1s

咦,那就还是加载失败了,因为QWenTokenizer这个类就是在你修改过的tokenization_qwen.py中定义的;按说qwen不会不支持加载本地分词器啊,本地的模型文件目录下面东西是齐的不,qwen.tiktokentokenization_qwen.pytokenizer_config.json啥的都在?这可有点玄学了Orz

跑通了!你提到模型目录下的tokenization_qwen.py,我没意识到是这个文件夹下的,我一直改的是c盘下面的那个,不好意思,感谢感谢!

你好你好,遇到了同样的问题,我试图去改transformers/models/qwen2/tokenization_qwen.py里的东西,但发现没有那个在.cache 中的代码?似乎都没有涉及到ttf的内容,请问你是改动了哪个文件呢

就是你下载的模型文件里面的tokenization_qwen.py,29-35行,参考以上

humphreyde avatar Jul 26 '24 03:07 humphreyde

咦,那就还是加载失败了,因为QWenTokenizer这个类就是在你修改过的tokenization_qwen.py中定义的;按说qwen不会不支持加载本地分词器啊,本地的模型文件目录下面东西是齐的不,qwen.tiktokentokenization_qwen.pytokenizer_config.json啥的都在?这可有点玄学了Orz

跑通了!你提到模型目录下的tokenization_qwen.py,我没意识到是这个文件夹下的,我一直改的是c盘下面的那个,不好意思,感谢感谢!

你好你好,遇到了同样的问题,我试图去改transformers/models/qwen2/tokenization_qwen.py里的东西,但发现没有那个在.cache 中的代码?似乎都没有涉及到ttf的内容,请问你是改动了哪个文件呢

就是你下载的模型文件里面的tokenization_qwen.py,29-35行,参考以上

噢我明白了,感谢!

another1s avatar Jul 26 '24 05:07 another1s