CogVLM2:cogvlm2-llama3-chinese-chat-19B 模型在 win11 x64 平台 4bit 加载下,运行报错。
System Info / 系統信息
环境状态如下: Windows 11 x64、Python 3.11.9、CUDA 12.1、Torch/torchvision/xformers/transformers/chainlit 关键依赖项,完全按照官方 requirements.txt 安装。后来根据系统提示,加装了:einops-0.8.0、triton-2.1.0、accelerate-0.30.1、psutil-5.9.8系统环境路径设置: CUDA_HOME=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1 CUDA_VISIBLE_DEVICES=0
为了 4bit 量化加载,修改了 web_demo.py 脚本中模型加载部分的参数,具体如下,原脚本: from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
MODEL_PATH = "THUDM/cogvlm2-llama3-chat-19B" TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8 else torch.float16 tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, torch_dtype=TORCH_TYPE, trust_remote_code=True).to(DEVICE).eval()
修改后脚本: from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer, BitsAndBytesConfig
fp4_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="fp4", bnb_4bit_compute_dtype=torch.float32)
MODEL_PATH = "checkpoints/cogvlm2-llama3-chinese-chat-19B" TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8 else torch.float32 tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, torch_dtype=TORCH_TYPE, trust_remote_code=True, quantization_config=fp4_config, device_map="auto").eval()
其它未做任何更改。
执行命令:chainlit run web_demo_me.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The load_in_4bit and load_in_8bit arguments are deprecated and will be removed in the future versions. Please, pass a BitsAndBytesConfig object in quantization_config argument instead.
2024-05-22 13:44:03 - We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set max_memory in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 8/8 [00:55<00:00, 6.90s/it]
2024-05-22 13:45:00 - Your app is available at http://localhost:8000
2024-05-22 13:45:02 - Translation file for zh-CN not found. Using default translation en-US.
2024-05-22 13:45:02 - Translated markdown file for zh-CN not found. Defaulting to chainlit.md.
2024-05-22 13:45:15 - Translation file for zh-CN not found. Using default translation en-US.
2024-05-22 13:45:15 - Translation file for zh-CN not found. Using default translation en-US.
2024-05-22 13:45:15 - Translated markdown file for zh-CN not found. Defaulting to chainlit.md.
main.c
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin/../include\cuda.h(20247): warning C4819: 该文件包含不能在当前代码页(936)中表示的字符。请将该文件保存为 Unicode 格式以防止数据丢失
C:\Users\ADMINI~1\AppData\Local\Temp\tmptho3klk7\main.c(10): fatal error C1083: 无法打开包括文件: “Python.h”: No such file or directory
Exception in thread Thread-2 (generate):
Traceback (most recent call last):
File "D:\AITest\CogVLM2\Python311\Lib\threading.py", line 1045, in _bootstrap_inner
self.run()
File "D:\AITest\CogVLM2\Python311\Lib\threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "D:\AITest\CogVLM2\Python311\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "D:\AITest\CogVLM2\Python311\Lib\site-packages\transformers\generation\utils.py", line 1736, in generate
result = self._sample(
^^^^^^^^^^^^^
File "D:\AITest\CogVLM2\Python311\Lib\site-packages\transformers\generation\utils.py", line 2375, in _sample
outputs = self(
^^^^^
File "D:\AITest\CogVLM2\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AITest\CogVLM2\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AITest\CogVLM2\Python311\Lib\site-packages\accelerate\hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator.cache\huggingface\modules\transformers_modules\cogvlm2-llama3-chinese-chat-19B\modeling_cogvlm.py", line 620, in forward
outputs = self.model(
^^^^^^^^^^^
File "D:\AITest\CogVLM2\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AITest\CogVLM2\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AITest\CogVLM2\Python311\Lib\site-packages\accelerate\hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator.cache\huggingface\modules\transformers_modules\cogvlm2-llama3-chinese-chat-19B\modeling_cogvlm.py", line 402, in forward
return self.llm_forward(
^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator.cache\huggingface\modules\transformers_modules\cogvlm2-llama3-chinese-chat-19B\modeling_cogvlm.py", line 486, in llm_forward
layer_outputs = decoder_layer(
^^^^^^^^^^^^^^
File "D:\AITest\CogVLM2\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AITest\CogVLM2\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AITest\CogVLM2\Python311\Lib\site-packages\accelerate\hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator.cache\huggingface\modules\transformers_modules\cogvlm2-llama3-chinese-chat-19B\modeling_cogvlm.py", line 261, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
^^^^^^^^^^^^^^^
File "D:\AITest\CogVLM2\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AITest\CogVLM2\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AITest\CogVLM2\Python311\Lib\site-packages\accelerate\hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator.cache\huggingface\modules\transformers_modules\cogvlm2-llama3-chinese-chat-19B\modeling_cogvlm.py", line 204, in forward
query_states, key_states = self.rotary_emb(query_states, key_states, position_ids=position_ids, max_seqlen=position_ids.max() + 1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AITest\CogVLM2\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AITest\CogVLM2\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AITest\CogVLM2\Python311\Lib\site-packages\accelerate\hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator.cache\huggingface\modules\transformers_modules\cogvlm2-llama3-chinese-chat-19B\util.py", line 469, in forward
q = apply_rotary_emb_func(
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator.cache\huggingface\modules\transformers_modules\cogvlm2-llama3-chinese-chat-19B\util.py", line 329, in apply_rotary_emb
return ApplyRotaryEmb.apply(
^^^^^^^^^^^^^^^^^^^^^
File "D:\AITest\CogVLM2\Python311\Lib\site-packages\torch\autograd\function.py", line 598, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator.cache\huggingface\modules\transformers_modules\cogvlm2-llama3-chinese-chat-19B\util.py", line 255, in forward
out = apply_rotary(
^^^^^^^^^^^^^
File "C:\Users\Administrator.cache\huggingface\modules\transformers_modules\cogvlm2-llama3-chinese-chat-19B\util.py", line 212, in apply_rotary
rotary_kernel[grid](
File "D:\AITest\CogVLM2\Python311\Lib\site-packages\triton\runtime\jit.py", line 160, in
看起来模型量化加载正常,Web 界面启动正常,提交 prompt 后终端窗口有些报错,似乎想要做一些 cuda 的本地编译,但好像不成功。GPU占用月16GB左右,符合官方给出的参数,运行并未中断,第二次提交 prompt 仍可以,但报 TypeError 如上,进程并未中断。
Web 界面执行情况如下图:
烦请诸位大佬百忙中加以分析,给些指点,多谢!
Who can help? / 谁可以帮助到您?
No response
Information / 问题信息
- [X] The official example scripts / 官方的示例脚本
- [X] My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
修改的脚本是: from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer, BitsAndBytesConfig
fp4_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="fp4", bnb_4bit_compute_dtype=torch.float32)
MODEL_PATH = "checkpoints/cogvlm2-llama3-chinese-chat-19B" TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8 else torch.float32 tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, torch_dtype=TORCH_TYPE, trust_remote_code=True, quantization_config=fp4_config, device_map="auto").eval()
替换上述几行后,希望可以复现问题。
Expected behavior / 期待表现
如果能给个官方 4bit 量化运行的脚本就好了,谢谢。
谢谢回复!并不是提交了空消息,是模型接收后,Web界面没有正常回显,而后台 Triton 编译已经出错。 经过添加环境变量,理顺支持 Triton 运行时的 VS 编译参数,现在已能正常编译执行,加载 4bit 模型,并正常对话,CogVLM2 输出质量很不错!为你们的工作成果点赞。。。
目前模型量化加载和运行没啥问题了,chainlit 首次上传图片,回显图片依然不正常,如下图的对话过程,输出结果是正常的。
而点击 “New Chat” 新建会话后,一切就正常了,过程如下图:
这个我也不知道是什么原因,是 chainlit 问题,还是 demo 脚本处理对话历史的问题,大佬有空可以分析分析,谢谢!
我也报差不多的错误,请问添加什么环境变量
我也报差不多的错误,请问添加什么环境变量
我看报错信息是:fatal error C1083: 无法打开包括文件: “Python.h”: No such file or directory,我就把当前 Python 环境的 Python311\include 放到 环境变量 INCLUDE 里了。在我环境中,具体是: INCLUDE=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include;C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.29.30133\include;C:\Program Files (x86)\Windows Kits\10\Include\10.0.19041.0\um;C:\Program Files (x86)\Windows Kits\10\Include\10.0.19041.0\ucrt;C:\Program Files (x86)\Windows Kits\10\Include\10.0.19041.0\shared;D:\AITest\CogVLM2\Python311\include; 以上,供您参考。。。