ChatGLM-6B [BUG] <多卡推理demo报错>

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

│ /home/shy/.cache/huggingface/modules/transformers_modules/ChatGLM-6B/modeling_chatglm.py:202 in │ │ forward │ │ │ │ 199 │ │ │ seq_len = x.shape[seq_dim] │ │ 200 │ │ if self.max_seq_len_cached is None or (seq_len > self.max_seq_len_cached): │ │ 201 │ │ │ self.max_seq_len_cached = None if self.learnable else seq_len │ │ ❱ 202 │ │ │ t = torch.arange(seq_len, device=x.device, dtype=self.inv_freq.dtype) │ │ 203 │ │ │ freqs = torch.einsum('i,j->ij', t, self.inv_freq) │ │ 204 │ │ │ # Different from paper, but it uses a different permutation in order to obta │ │ 205 │ │ │ emb = torch.cat((freqs, freqs), dim=-1).to(x.device) │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory.

Expected Behavior

No response

Steps To Reproduce

cli_demo.py中

# model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda()
from utils import load_model_on_gpus
model = load_model_on_gpus("THUDM/chatglm-6b", num_gpus=4)

其它no change

Environment

- OS:Ubuntu 18.04
- Python:3.10.9
- Transformers:4.27.1
- PyTorch:2.0.0
- CUDA Support:True
- cuda-runtime:11.7.1
- device:Tesla K80*4

Anything else?

No response

Apr 24 '23 07:04 ticoAg

同问，楼主是怎么解决的

May 05 '23 14:05 204313508

我也遇到了同样的问题，4张4090多卡推理时报同样的错，将num_gpus=1后不会报错。报错提示分配1EB显存，很离谱。有解决方案吗？

May 08 '23 18:05 ntz2000

同样的问题，请问如何解决？

May 22 '23 03:05 winefox

这个问题随机出现，目前没有找到办法

May 30 '23 11:05 liaoweiguo