ChatGLM-6B chatGLM 推理时，显存递增直至爆满问题

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

今天试了一下 chatGLM , 发现随着推理的轮数增加，显存会递增，直至爆显问题

─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ in <cell line: 1> │ │ │ │ ❱ 1 response, history = model.chat(tokenizer, "帮我开发的智能扫地机器人起个名字，再写一篇600 │ │ 2 print(response) │ │ 3 │ │ │ │/anaconda3/envs/py38_chat/lib/python3.8/site-packages/torch/autograd/grad_mode.py:27 │ │ in decorate_context │ │ │ │ 24 │ │ @functools.wraps(func) │ │ 25 │ │ def decorate_context(*args, **kwargs): │ │ 26 │ │ │ with self.clone(): │ │ ❱ 27 │ │ │ │ return func(*args, **kwargs) │ │ 28 │ │ return cast(F, decorate_context) │ │ 29 │ │ │ 30 │ def _wrap_generator(self, func): │ │ │ │/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/1b54948bb28de5258b5 │ │ 5b893e193c3046a0b0484/modeling_chatglm.py:1107 in chat │ │ │ │ 1104 │ │ │ prompt += "[Round {}]\n问：{}\n答：".format(len(history), query) │ │ 1105 │ │ input_ids = tokenizer([prompt], return_tensors="pt", padding=True) │ │ 1106 │ │ input_ids = input_ids.to(self.device) │ │ ❱ 1107 │ │ outputs = self.generate(**input_ids, **gen_kwargs) │ │ 1108 │ │ outputs = outputs.tolist()[0][len(input_ids["input_ids"][0]):] │ │ 1109 │ │ response = tokenizer.decode(outputs) │ │ 1110 │ │ response = response.strip() │ │ │ │/anaconda3/envs/py38_chat/lib/python3.8/site-packages/torch/autograd/grad_mode.py:27 │ │ in decorate_context │ │ │ │ 24 │ │ @functools.wraps(func) │ │ 25 │ │ def decorate_context(*args, **kwargs): │ │ 26 │ │ │ with self.clone(): │ │ ❱ 27 │ │ │ │ return func(*args, **kwargs) │ │ 28 │ │ return cast(F, decorate_context) │ │ 29 │ │ │ 30 │ def _wrap_generator(self, func): │ │ │ │/anaconda3/envs/py38_chat/lib/python3.8/site-packages/transformers/generation/utils.p │ │ y:1437 in generate │ │ │ │ 1434 │ │ │ ) │ │ 1435 │ │ │ │ │ 1436 │ │ │ # 13. run sample │ │ ❱ 1437 │ │ │ return self.sample( │ │ 1438 │ │ │ │ input_ids, │ │ 1439 │ │ │ │ logits_processor=logits_processor, │ │ 1440 │ │ │ │ logits_warper=logits_warper, │ │ │ │ /anaconda3/envs/py38_chat/lib/python3.8/site-packages/transformers/generation/utils.p │ │ y:2443 in sample │ │ │ │ 2440 │ │ │ model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) │ │ 2441 │ │ │ │ │ 2442 │ │ │ # forward pass to get next token │ │ ❱ 2443 │ │ │ outputs = self( │ │ 2444 │ │ │ │ **model_inputs, │ │ 2445 │ │ │ │ return_dict=True, │ │ 2446 │ │ │ │ output_attentions=output_attentions, │ │ │ │/anaconda3/envs/py38_chat/lib/python3.8/site-packages/torch/nn/modules/module.py:1110 │ │ in _call_impl │ │ │ │ 1107 │ │ # this function, and just call forward. │ │ 1108 │ │ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o │ │ 1109 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1110 │ │ │ return forward_call(*input, **kwargs) │ │ 1111 │ │ # Do not call functions when jit is used │ │ 1112 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1113 │ │ if self._backward_hooks or _global_backward_hooks: │ │ │ │/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/1b54948bb28de5258b5 │ │ 5b893e193c3046a0b0484/modeling_chatglm.py:1027 in forward │ │ │ │ 1024 │ │ use_cache = use_cache if use_cache is not None else self.config.use_cache │ │ 1025 │ │ return_dict = return_dict if return_dict is not None else self.config.use_return │ │ 1026 │ │ │ │ ❱ 1027 │ │ transformer_outputs = self.transformer( │ │ 1028 │ │ │ input_ids=input_ids, │ │ 1029 │ │ │ position_ids=position_ids, │ │ 1030 │ │ │ attention_mask=attention_mask, │ │ │ │/anaconda3/envs/py38_chat/lib/python3.8/site-packages/torch/nn/modules/module.py:1110 │ │ in _call_impl │ │ │ │ 1107 │ │ # this function, and just call forward. │ │ 1108 │ │ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o │ │ 1109 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1110 │ │ │ return forward_call(*input, **kwargs) │ │ 1111 │ │ # Do not call functions when jit is used │ │ 1112 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1113 │ │ if self._backward_hooks or _global_backward_hooks: │ │ │ │/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/1b54948bb28de5258b5 │ │ 5b893e193c3046a0b0484/modeling_chatglm.py:872 in forward │ │ │ │ 869 │ │ │ if output_hidden_states: │ │ 870 │ │ │ │ all_hidden_states = all_hidden_states + (hidden_states,) │ │ 871 │ │ │ │ │ ❱ 872 │ │ │ layer_ret = layer( │ │ 873 │ │ │ │ hidden_states, │ │ 874 │ │ │ │ position_ids=position_ids, │ │ 875 │ │ │ │ attention_mask=attention_mask, │ │ │ │ /anaconda3/envs/py38_chat/lib/python3.8/site-packages/torch/nn/modules/module.py:1110 │ │ in _call_impl │ │ │ │ 1107 │ │ # this function, and just call forward. │ │ 1108 │ │ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o │ │ 1109 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1110 │ │ │ return forward_call(*input, **kwargs) │ │ 1111 │ │ # Do not call functions when jit is used │ │ 1112 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1113 │ │ if self._backward_hooks or _global_backward_hooks: │ │ │ │ /.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/1b54948bb28de5258b5 │ │ 5b893e193c3046a0b0484/modeling_chatglm.py:573 in forward │ │ │ │ 570 │ │ attention_input = self.input_layernorm(hidden_states) │ │ 571 │ │ │ │ 572 │ │ # Self attention. │ │ ❱ 573 │ │ attention_outputs = self.attention( │ │ 574 │ │ │ attention_input, │ │ 575 │ │ │ position_ids, │ │ 576 │ │ │ attention_mask=attention_mask, │ │ │ │ /anaconda3/envs/py38_chat/lib/python3.8/site-packages/torch/nn/modules/module.py:1110 │ │ in _call_impl │ │ │ │ 1107 │ │ # this function, and just call forward. │ │ 1108 │ │ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o │ │ 1109 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1110 │ │ │ return forward_call(*input, **kwargs) │ │ 1111 │ │ # Do not call functions when jit is used │ │ 1112 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1113 │ │ if self._backward_hooks or _global_backward_hooks: │ │ │ │ /.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/1b54948bb28de5258b5 │ │ 5b893e193c3046a0b0484/modeling_chatglm.py:427 in forward │ │ │ │ 424 │ │ │ query_layer, key_layer = apply_rotary_pos_emb_index(query_layer, key_layer, │ │ 425 │ │ │ │ 426 │ │ # [seq_len, batch, hidden_size] │ │ ❱ 427 │ │ context_layer, present, attention_probs = attention_fn( │ │ 428 │ │ │ self=self, │ │ 429 │ │ │ query_layer=query_layer, │ │ 430 │ │ │ key_layer=key_layer, │ │ │ │ /.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/1b54948bb28de5258b5 │ │ 5b893e193c3046a0b0484/modeling_chatglm.py:215 in attention_fn │ │ │ │ 212 ): │ │ 213 │ if layer_past is not None: │ │ 214 │ │ past_key, past_value = layer_past │ │ ❱ 215 │ │ key_layer = torch.cat((past_key, key_layer), dim=0) │ │ 216 │ │ value_layer = torch.cat((past_value, value_layer), dim=0) │ │ 217 │ │ │ 218 │ # seqlen, batch, num_attention_heads, hidden_size_per_attention_head │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 23.70 GiB total capacity; 12.78 GiB already allocated; 7.56 MiB free; 12.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Expected Behavior

No response

Steps To Reproduce

chatGLM

Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

No response

Mar 21 '23 01:03 km1994

Multi-turn 就是会逐步增加显存消耗的

Mar 21 '23 09:03 kizunasunhy

可以通过修改history最大储存历史对话数量、使用更小的量化模型来解决爆显存的问题。而且时常clear也可以解决，仔细观察可以发现你clear了之后虽然显存不会下降但在后续多轮对话中也不会上升。我估计是因为显存占用高是因为模型第一次使用的时候标注了这些显存，clear后以前模型占用的显存空间在模型内部释放了，模型后续运行会覆盖这些被释放的空间，但是系统不知道，会一直以为模型占用这么多的显存。

Mar 22 '23 02:03 youngforblbl

请问有推荐配置吗？用什么型号的显卡CPU

Mar 22 '23 06:03 jamyriver

每次清空对话后同时加上 torch.cuda.empty_cache()

Mar 23 '23 02:03 zhongtao93

默认代码推荐使用24GB显存及以上的GPU，一般不会爆显存。如果你只有12GB左右，可以使用int4量化，一般不会爆显存。其实如果一开始能正常运行的话，把最大对话轮数降低一些（比如降到10轮以内）就可以保持稳定运行了。

Mar 23 '23 09:03 yaleimeng

可以通过修改history最大储存历史对话数量、使用更小的量化模型来解决爆显存的问题。而且时常clear也可以解决，仔细观察可以发现你clear了之后虽然显存不会下降但在后续多轮对话中也不会上升。我估计是因为显存占用高是因为模型第一次使用的时候标注了这些显存，clear后以前模型占用的显存空间在模型内部释放了，模型后续运行会覆盖这些被释放的空间，但是系统不知道，会一直以为模型占用这么多的显存。怎么clear？

Apr 13 '23 07:04 hugefrog

其实如果一开始能正常运行的话，把最大对话轮数降低一些（比如降到10轮以内）就可以保持稳定运行了。

请问有具体操作的方法吗？

Apr 13 '23 10:04 turkeymz

其实如果一开始能正常运行的话，把最大对话轮数降低一些（比如降到10轮以内）就可以保持稳定运行了。

请问有具体操作的方法吗？

以web_demo.py为例

改源代码第61~62行为

top_p=top_p, temperature=temperature, history=(history if len(history) <= 10 else history[-10:])):

history就只记住最后10轮对话了

Apr 14 '23 14:04 tjsky

按照楼上说的修改后还是显存溢出

Jun 01 '23 08:06 sujunze

使用int4，仍然报错，不知道为啥 OutOfMemoryError: CUDA out of memory. Tried to allocate 11.87 GiB (GPU 0; 11.99 GiB total capacity; 9.28 GiB already allocated; 0 bytes free; 10.16 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Jul 05 '23 16:07 Chang-Tong

Same here, on ChatGLM2-6B, doing 0-shot jobs w/o histories still got the RAM of the cuda device growing.

update: it's caused by langchain.llms.chatGLM's implementation, and I provided a fix on this via https://github.com/hwchase17/langchain/pull/8048

Jul 20 '23 11:07 wey-gu

ChatGLM-6B ChatGLM-6B copied to clipboard

chatGLM 推理时，显存递增直至爆满问题

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

ChatGLM-6B
ChatGLM-6B copied to clipboard