ChatGLM-6B
ChatGLM-6B copied to clipboard
chatGLM 推理时,显存递增直至爆满问题
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
今天 试了一下 chatGLM , 发现 随着 推理的轮数增加,显存会递增,直至爆显问题
─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ in <cell line: 1> │ │ │ │ ❱ 1 response, history = model.chat(tokenizer, "帮我开发的智能扫地机器人起个名字,再写一篇600 │ │ 2 print(response) │ │ 3 │ │ │ │/anaconda3/envs/py38_chat/lib/python3.8/site-packages/torch/autograd/grad_mode.py:27 │ │ in decorate_context │ │ │ │ 24 │ │ @functools.wraps(func) │ │ 25 │ │ def decorate_context(*args, **kwargs): │ │ 26 │ │ │ with self.clone(): │ │ ❱ 27 │ │ │ │ return func(*args, **kwargs) │ │ 28 │ │ return cast(F, decorate_context) │ │ 29 │ │ │ 30 │ def _wrap_generator(self, func): │ │ │ │/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/1b54948bb28de5258b5 │ │ 5b893e193c3046a0b0484/modeling_chatglm.py:1107 in chat │ │ │ │ 1104 │ │ │ prompt += "[Round {}]\n问:{}\n答:".format(len(history), query) │ │ 1105 │ │ input_ids = tokenizer([prompt], return_tensors="pt", padding=True) │ │ 1106 │ │ input_ids = input_ids.to(self.device) │ │ ❱ 1107 │ │ outputs = self.generate(**input_ids, **gen_kwargs) │ │ 1108 │ │ outputs = outputs.tolist()[0][len(input_ids["input_ids"][0]):] │ │ 1109 │ │ response = tokenizer.decode(outputs) │ │ 1110 │ │ response = response.strip() │ │ │ │/anaconda3/envs/py38_chat/lib/python3.8/site-packages/torch/autograd/grad_mode.py:27 │ │ in decorate_context │ │ │ │ 24 │ │ @functools.wraps(func) │ │ 25 │ │ def decorate_context(*args, **kwargs): │ │ 26 │ │ │ with self.clone(): │ │ ❱ 27 │ │ │ │ return func(*args, **kwargs) │ │ 28 │ │ return cast(F, decorate_context) │ │ 29 │ │ │ 30 │ def _wrap_generator(self, func): │ │ │ │/anaconda3/envs/py38_chat/lib/python3.8/site-packages/transformers/generation/utils.p │ │ y:1437 in generate │ │ │ │ 1434 │ │ │ ) │ │ 1435 │ │ │ │ │ 1436 │ │ │ # 13. run sample │ │ ❱ 1437 │ │ │ return self.sample( │ │ 1438 │ │ │ │ input_ids, │ │ 1439 │ │ │ │ logits_processor=logits_processor, │ │ 1440 │ │ │ │ logits_warper=logits_warper, │ │ │ │ /anaconda3/envs/py38_chat/lib/python3.8/site-packages/transformers/generation/utils.p │ │ y:2443 in sample │ │ │ │ 2440 │ │ │ model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) │ │ 2441 │ │ │ │ │ 2442 │ │ │ # forward pass to get next token │ │ ❱ 2443 │ │ │ outputs = self( │ │ 2444 │ │ │ │ **model_inputs, │ │ 2445 │ │ │ │ return_dict=True, │ │ 2446 │ │ │ │ output_attentions=output_attentions, │ │ │ │/anaconda3/envs/py38_chat/lib/python3.8/site-packages/torch/nn/modules/module.py:1110 │ │ in _call_impl │ │ │ │ 1107 │ │ # this function, and just call forward. │ │ 1108 │ │ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o │ │ 1109 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1110 │ │ │ return forward_call(*input, **kwargs) │ │ 1111 │ │ # Do not call functions when jit is used │ │ 1112 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1113 │ │ if self._backward_hooks or _global_backward_hooks: │ │ │ │/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/1b54948bb28de5258b5 │ │ 5b893e193c3046a0b0484/modeling_chatglm.py:1027 in forward │ │ │ │ 1024 │ │ use_cache = use_cache if use_cache is not None else self.config.use_cache │ │ 1025 │ │ return_dict = return_dict if return_dict is not None else self.config.use_return │ │ 1026 │ │ │ │ ❱ 1027 │ │ transformer_outputs = self.transformer( │ │ 1028 │ │ │ input_ids=input_ids, │ │ 1029 │ │ │ position_ids=position_ids, │ │ 1030 │ │ │ attention_mask=attention_mask, │ │ │ │/anaconda3/envs/py38_chat/lib/python3.8/site-packages/torch/nn/modules/module.py:1110 │ │ in _call_impl │ │ │ │ 1107 │ │ # this function, and just call forward. │ │ 1108 │ │ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o │ │ 1109 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1110 │ │ │ return forward_call(*input, **kwargs) │ │ 1111 │ │ # Do not call functions when jit is used │ │ 1112 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1113 │ │ if self._backward_hooks or _global_backward_hooks: │ │ │ │/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/1b54948bb28de5258b5 │ │ 5b893e193c3046a0b0484/modeling_chatglm.py:872 in forward │ │ │ │ 869 │ │ │ if output_hidden_states: │ │ 870 │ │ │ │ all_hidden_states = all_hidden_states + (hidden_states,) │ │ 871 │ │ │ │ │ ❱ 872 │ │ │ layer_ret = layer( │ │ 873 │ │ │ │ hidden_states, │ │ 874 │ │ │ │ position_ids=position_ids, │ │ 875 │ │ │ │ attention_mask=attention_mask, │ │ │ │ /anaconda3/envs/py38_chat/lib/python3.8/site-packages/torch/nn/modules/module.py:1110 │ │ in _call_impl │ │ │ │ 1107 │ │ # this function, and just call forward. │ │ 1108 │ │ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o │ │ 1109 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1110 │ │ │ return forward_call(*input, **kwargs) │ │ 1111 │ │ # Do not call functions when jit is used │ │ 1112 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1113 │ │ if self._backward_hooks or _global_backward_hooks: │ │ │ │ /.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/1b54948bb28de5258b5 │ │ 5b893e193c3046a0b0484/modeling_chatglm.py:573 in forward │ │ │ │ 570 │ │ attention_input = self.input_layernorm(hidden_states) │ │ 571 │ │ │ │ 572 │ │ # Self attention. │ │ ❱ 573 │ │ attention_outputs = self.attention( │ │ 574 │ │ │ attention_input, │ │ 575 │ │ │ position_ids, │ │ 576 │ │ │ attention_mask=attention_mask, │ │ │ │ /anaconda3/envs/py38_chat/lib/python3.8/site-packages/torch/nn/modules/module.py:1110 │ │ in _call_impl │ │ │ │ 1107 │ │ # this function, and just call forward. │ │ 1108 │ │ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o │ │ 1109 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1110 │ │ │ return forward_call(*input, **kwargs) │ │ 1111 │ │ # Do not call functions when jit is used │ │ 1112 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1113 │ │ if self._backward_hooks or _global_backward_hooks: │ │ │ │ /.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/1b54948bb28de5258b5 │ │ 5b893e193c3046a0b0484/modeling_chatglm.py:427 in forward │ │ │ │ 424 │ │ │ query_layer, key_layer = apply_rotary_pos_emb_index(query_layer, key_layer, │ │ 425 │ │ │ │ 426 │ │ # [seq_len, batch, hidden_size] │ │ ❱ 427 │ │ context_layer, present, attention_probs = attention_fn( │ │ 428 │ │ │ self=self, │ │ 429 │ │ │ query_layer=query_layer, │ │ 430 │ │ │ key_layer=key_layer, │ │ │ │ /.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/1b54948bb28de5258b5 │ │ 5b893e193c3046a0b0484/modeling_chatglm.py:215 in attention_fn │ │ │ │ 212 ): │ │ 213 │ if layer_past is not None: │ │ 214 │ │ past_key, past_value = layer_past │ │ ❱ 215 │ │ key_layer = torch.cat((past_key, key_layer), dim=0) │ │ 216 │ │ value_layer = torch.cat((past_value, value_layer), dim=0) │ │ 217 │ │ │ 218 │ # seqlen, batch, num_attention_heads, hidden_size_per_attention_head │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 23.70 GiB total capacity; 12.78 GiB already allocated; 7.56 MiB free; 12.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Expected Behavior
No response
Steps To Reproduce
chatGLM
Environment
- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :
Anything else?
No response
Multi-turn 就是会逐步增加显存消耗的
可以通过修改history最大储存历史对话数量、使用更小的量化模型来解决爆显存的问题。而且时常clear也可以解决,仔细观察可以发现你clear了之后虽然显存不会下降但在后续多轮对话中也不会上升。我估计是因为显存占用高是因为模型第一次使用的时候标注了这些显存,clear后以前模型占用的显存空间在模型内部释放了,模型后续运行会覆盖这些被释放的空间,但是系统不知道,会一直以为模型占用这么多的显存。
请问有推荐配置吗? 用什么型号的显卡CPU
每次清空对话后同时加上 torch.cuda.empty_cache()
默认代码推荐使用24GB显存及以上的GPU,一般不会爆显存。 如果你只有12GB左右,可以使用int4量化,一般不会爆显存。 其实如果一开始能正常运行的话,把最大对话轮数降低一些(比如降到10轮以内)就可以保持稳定运行了。
可以通过修改history最大储存历史对话数量、使用更小的量化模型来解决爆显存的问题。而且时常clear也可以解决,仔细观察可以发现你clear了之后虽然显存不会下降但在后续多轮对话中也不会上升。我估计是因为显存占用高是因为模型第一次使用的时候标注了这些显存,clear后以前模型占用的显存空间在模型内部释放了,模型后续运行会覆盖这些被释放的空间,但是系统不知道,会一直以为模型占用这么多的显存。 怎么clear?
其实如果一开始能正常运行的话,把最大对话轮数降低一些(比如降到10轮以内)就可以保持稳定运行了。
请问有具体操作的方法吗?
其实如果一开始能正常运行的话,把最大对话轮数降低一些(比如降到10轮以内)就可以保持稳定运行了。
请问有具体操作的方法吗?
以web_demo.py为例
改源代码第61~62行为
top_p=top_p, temperature=temperature, history=(history if len(history) <= 10 else history[-10:])):
history就只记住最后10轮对话了
按照楼上说的修改后还是显存溢出
使用int4,仍然报错,不知道为啥 OutOfMemoryError: CUDA out of memory. Tried to allocate 11.87 GiB (GPU 0; 11.99 GiB total capacity; 9.28 GiB already allocated; 0 bytes free; 10.16 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Same here, on ChatGLM2-6B, doing 0-shot jobs w/o histories still got the RAM of the cuda device growing.
update: it's caused by langchain.llms.chatGLM's implementation, and I provided a fix on this via https://github.com/hwchase17/langchain/pull/8048