[BUG] 4块v100卡爆显存,在LLM会话模式也一样
问题描述 / Problem Description 机器4块v100卡运行也会爆显存,LLM会话模式和问答都有问题
复现问题的步骤 / Steps to Reproduce 每次都出现
实际结果 / Actual Result
To create a public link, set share=True in launch().
Traceback (most recent call last):
File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/gradio/routes.py", line 412, in run_predict
output = await app.get_blocks().process_api(
File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/gradio/blocks.py", line 1299, in process_api
result = await self.call_function(
File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/gradio/blocks.py", line 1035, in call_function
prediction = await anyio.to_thread.run_sync(
File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/anyio/to_thread.py", line 28, in run_sync
return await get_asynclib().run_sync_in_worker_thread(func, *args, cancellable=cancellable,
File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 818, in run_sync_in_worker_thread
return await future
File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 754, in run
result = context.run(func, *args)
File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/gradio/utils.py", line 491, in async_iteration
return next(iterator)
File "webui.py", line 72, in get_answer
for resp, history in local_doc_qa.llm._call(query, history, streaming=streaming):
File "{home}/langchain-ChatGLM/models/chatglm_llm.py", line 72, in _call
for inum, (stream_resp, _) in enumerate(self.model.stream_chat(
File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 56, in generator_context
response = gen.send(request)
File "{home}/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1311, in stream_chat
for outputs in self.stream_generate(**inputs, **gen_kwargs):
File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 56, in generator_context
response = gen.send(request)
File "{home}/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1388, in stream_generate
outputs = self(
File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "{home}/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1190, in forward
transformer_outputs = self.transformer(
File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "{home}/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 996, in forward
layer_ret = layer(
File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "{home}/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 627, in forward
attention_outputs = self.attention(
File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "{home}/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 460, in forward
cos, sin = self.rotary_emb(q1, seq_len=position_ids.max() + 1)
File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "{home}/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 202, in forward
t = torch.arange(seq_len, device=x.device, dtype=self.inv_freq.dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory.
环境信息 / Environment Information
- langchain-ChatGLM 版本/commit 号:( commit f173d8b6b1f0ca51f33bf470050aea3564c61b21)
- 是否使用 Docker 部署(是/否):否
- 使用的模型: ChatGLM-6B
- 使用的 Embedding 模型:GanymedeNil/text2vec-large-chinese
- 操作系统及版本 :20.04.1-Ubuntu
- Python 版本 / Python version:3.8.16
要分配1EB显存,是哪儿出bug?
发现和这个情况https://github.com/THUDM/ChatGLM-6B/issues/990 类似
还是有一点区别,我这儿是模型加载是成果的,但开始会话或者问答后就失败
一样,我给了5个3090一样爆显存,而且我没有向模型添加任何文档。不管是知识库问答还是llm,只要问生成一个数学公式,大概率会炸😓
修改代码强制模型运行在一块卡上就可以,但多卡的问题没研究代码解决 models/chatglm_llm.py num_gpus = torch.cuda.device_count() 修改为num_gpus = 1 强制为单卡后,绕过了bug
我这没有问题,双A100,程序加载在两块卡上。有自己的知识库,无论知识库还是llm都OK
PID 54497就是这个推理部署,在两个卡上跑(单卡显存完全够,不太懂他为什么要分不到两个不同的卡,会不会更慢)
是不是版本不同,感觉可能是最近代码的修改里的bug? 分布到多卡的原因感觉是为了做多卡的负载分布吧
同样遇到这个问题,请问有人解决了么?
同遇到这个问题,但我是单张V100上报这个问题:知识库向量化后显存都正常,就是点击发送问题后就爆了
同遇到这个问题,但我是单张V100上报这个问题:知识库向量化后显存都正常,就是点击发送问题后就爆了
我的解决:换了一个干净的文档(没有特殊字符)就好了,用之前的文档问”你好“也直接爆显存
不太懂他为什么要分不到两个不同的卡,会不会更慢)
对于文档的纯净度要求这么高的吗?我的就是每问一句话,显存就增加几G,直到显存释放,如果多几个人用想都不敢想
我的修改方式是models/loader/loader.py的num_gpus = 1