Langchain-Chatchat icon indicating copy to clipboard operation
Langchain-Chatchat copied to clipboard

[BUG] 4块v100卡爆显存,在LLM会话模式也一样

Open charryshi opened this issue 2 years ago • 10 comments

问题描述 / Problem Description 机器4块v100卡运行也会爆显存,LLM会话模式和问答都有问题

复现问题的步骤 / Steps to Reproduce 每次都出现

实际结果 / Actual Result

To create a public link, set share=True in launch(). Traceback (most recent call last): File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/gradio/routes.py", line 412, in run_predict output = await app.get_blocks().process_api( File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/gradio/blocks.py", line 1299, in process_api result = await self.call_function( File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/gradio/blocks.py", line 1035, in call_function prediction = await anyio.to_thread.run_sync( File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/anyio/to_thread.py", line 28, in run_sync return await get_asynclib().run_sync_in_worker_thread(func, *args, cancellable=cancellable, File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 818, in run_sync_in_worker_thread return await future File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 754, in run result = context.run(func, *args) File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/gradio/utils.py", line 491, in async_iteration return next(iterator) File "webui.py", line 72, in get_answer for resp, history in local_doc_qa.llm._call(query, history, streaming=streaming): File "{home}/langchain-ChatGLM/models/chatglm_llm.py", line 72, in _call for inum, (stream_resp, _) in enumerate(self.model.stream_chat( File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 56, in generator_context response = gen.send(request) File "{home}/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1311, in stream_chat for outputs in self.stream_generate(**inputs, **gen_kwargs): File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 56, in generator_context response = gen.send(request) File "{home}/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1388, in stream_generate outputs = self( File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "{home}/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1190, in forward transformer_outputs = self.transformer( File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "{home}/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 996, in forward layer_ret = layer( File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "{home}/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 627, in forward attention_outputs = self.attention( File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "{home}/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 460, in forward cos, sin = self.rotary_emb(q1, seq_len=position_ids.max() + 1) File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "{home}/miniconda3/envs/chatglm/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "{home}/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 202, in forward t = torch.arange(seq_len, device=x.device, dtype=self.inv_freq.dtype) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory.

环境信息 / Environment Information

  • langchain-ChatGLM 版本/commit 号:( commit f173d8b6b1f0ca51f33bf470050aea3564c61b21)
  • 是否使用 Docker 部署(是/否):否
  • 使用的模型: ChatGLM-6B
  • 使用的 Embedding 模型:GanymedeNil/text2vec-large-chinese
  • 操作系统及版本 :20.04.1-Ubuntu
  • Python 版本 / Python version:3.8.16

charryshi avatar May 12 '23 09:05 charryshi

要分配1EB显存,是哪儿出bug?

charryshi avatar May 12 '23 09:05 charryshi

发现和这个情况https://github.com/THUDM/ChatGLM-6B/issues/990 类似

charryshi avatar May 12 '23 09:05 charryshi

还是有一点区别,我这儿是模型加载是成果的,但开始会话或者问答后就失败

charryshi avatar May 12 '23 09:05 charryshi

一样,我给了5个3090一样爆显存,而且我没有向模型添加任何文档。不管是知识库问答还是llm,只要问生成一个数学公式,大概率会炸😓

KyonCN avatar May 12 '23 19:05 KyonCN

修改代码强制模型运行在一块卡上就可以,但多卡的问题没研究代码解决 models/chatglm_llm.py num_gpus = torch.cuda.device_count() 修改为num_gpus = 1 强制为单卡后,绕过了bug

charryshi avatar May 13 '23 14:05 charryshi

我这没有问题,双A100,程序加载在两块卡上。有自己的知识库,无论知识库还是llm都OK

PID 54497就是这个推理部署,在两个卡上跑(单卡显存完全够,不太懂他为什么要分不到两个不同的卡,会不会更慢)

截图 2023-05-15 11-59-39

liaoweiguo avatar May 15 '23 04:05 liaoweiguo

是不是版本不同,感觉可能是最近代码的修改里的bug? 分布到多卡的原因感觉是为了做多卡的负载分布吧

charryshi avatar May 15 '23 07:05 charryshi

同样遇到这个问题,请问有人解决了么?

winefox avatar May 22 '23 03:05 winefox

同遇到这个问题,但我是单张V100上报这个问题:知识库向量化后显存都正常,就是点击发送问题后就爆了

WorgenZhang avatar Jun 01 '23 12:06 WorgenZhang

同遇到这个问题,但我是单张V100上报这个问题:知识库向量化后显存都正常,就是点击发送问题后就爆了

我的解决:换了一个干净的文档(没有特殊字符)就好了,用之前的文档问”你好“也直接爆显存

WorgenZhang avatar Jun 01 '23 13:06 WorgenZhang

不太懂他为什么要分不到两个不同的卡,会不会更慢)

对于文档的纯净度要求这么高的吗?我的就是每问一句话,显存就增加几G,直到显存释放,如果多几个人用想都不敢想

ufocjm avatar Jun 26 '23 08:06 ufocjm

我的修改方式是models/loader/loader.py的num_gpus = 1

Alicia320 avatar Jul 26 '23 08:07 Alicia320