ChatGLM-6B Distributed deployment reasoning

Great work! Does it support distributed single-machine multi-card reasoning?

Mar 15 '23 06:03 T-Atlas

I have a simple try:

import os
import platform
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True,cache_dir='/home/sss/ModelWeight')
model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True,cache_dir='/home/sss/ModelWeight',device_map="auto").half().quantize(8)#.cuda()
model = model.eval()

os_name = platform.system()

history = []
print("欢迎使用 ChatGLM-6B 模型，输入内容即可进行对话，clear 清空对话历史，stop 终止程序")
while True:
    query = input("\n用户：")
    if query == "stop":
        break
    if query == "clear":
        history = []
        command = 'cls' if os_name == 'Windows' else 'clear'
        os.system(command)
        print("欢迎使用 ChatGLM-6B 模型，输入内容即可进行对话，clear 清空对话历史，stop 终止程序")
        continue
    response, history = model.chat(tokenizer, query, history=history)
    print(f"ChatGLM-6B：{response}")

and get:

Traceback (most recent call last):
  File "/home/sss/ChatGLM-6B/cli_demo.py", line 23, in <module>
    response, history = model.chat(tokenizer, query, history=history)
  File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/modeling_chatglm.py", line 1109, in chat
    outputs = self.generate(**input_ids, **gen_kwargs)
  File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/modeling_chatglm.py", line 1133, in generate
    output_ids = super().generate(**kwargs)
  File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/transformers/generation/utils.py", line 1437, in generate
    return self.sample(
  File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/transformers/generation/utils.py", line 2443, in sample
    outputs = self(
  File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/modeling_chatglm.py", line 1029, in forward
    transformer_outputs = self.transformer(
  File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/modeling_chatglm.py", line 874, in forward
    layer_ret = layer(
  File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/modeling_chatglm.py", line 573, in forward
    attention_outputs = self.attention(
  File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/modeling_chatglm.py", line 398, in forward
    mixed_raw_layer = self.query_key_value(hidden_states)
  File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/quantization.py", line 137, in forward
    output = W8A16Linear.apply(input, self.weight, self.weight_scale, self.weight_bit_width)
  File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/quantization.py", line 22, in forward
    output = inp.mm(weight.t())
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument mat2 in method wrapper_mm)

Not that easy, sigh.

Mar 15 '23 12:03 eeyrw

You have commented out the cuda() in your code. Why does it still run on cuda? Seems the error comes from quantization. Maybe you can remove .quantize(8) and try again.

Disclaimer: I'm not familiar with torch.

I have a simple try:

import os
import platform
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True,cache_dir='/home/sss/ModelWeight')
model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True,cache_dir='/home/sss/ModelWeight',device_map="auto").half().quantize(8)#.cuda()
model = model.eval()

os_name = platform.system()

history = []
print("欢迎使用 ChatGLM-6B 模型，输入内容即可进行对话，clear 清空对话历史，stop 终止程序")
while True:
    query = input("\n用户：")
    if query == "stop":
        break
    if query == "clear":
        history = []
        command = 'cls' if os_name == 'Windows' else 'clear'
        os.system(command)
        print("欢迎使用 ChatGLM-6B 模型，输入内容即可进行对话，clear 清空对话历史，stop 终止程序")
        continue
    response, history = model.chat(tokenizer, query, history=history)
    print(f"ChatGLM-6B：{response}")

and get:

Traceback (most recent call last):
  File "/home/sss/ChatGLM-6B/cli_demo.py", line 23, in <module>
    response, history = model.chat(tokenizer, query, history=history)
  File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/modeling_chatglm.py", line 1109, in chat
    outputs = self.generate(**input_ids, **gen_kwargs)
  File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/modeling_chatglm.py", line 1133, in generate
    output_ids = super().generate(**kwargs)
  File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/transformers/generation/utils.py", line 1437, in generate
    return self.sample(
  File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/transformers/generation/utils.py", line 2443, in sample
    outputs = self(
  File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/modeling_chatglm.py", line 1029, in forward
    transformer_outputs = self.transformer(
  File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/modeling_chatglm.py", line 874, in forward
    layer_ret = layer(
  File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/modeling_chatglm.py", line 573, in forward
    attention_outputs = self.attention(
  File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/modeling_chatglm.py", line 398, in forward
    mixed_raw_layer = self.query_key_value(hidden_states)
  File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/quantization.py", line 137, in forward
    output = W8A16Linear.apply(input, self.weight, self.weight_scale, self.weight_bit_width)
  File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/quantization.py", line 22, in forward
    output = inp.mm(weight.t())
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument mat2 in method wrapper_mm)

Not that easy, sigh.

Mar 16 '23 08:03 duanqn

@duanqn At the line model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True,cache_dir='/home/sss/ModelWeight',device_map="auto").half().quantize(8)#.cuda() where device_map="auto" is used, which means using auto multiple GPU assignment mode so no need of manual cuda(). I have tried half() but get same error.

Mar 16 '23 08:03 eeyrw

Can you please try the code mentioned in the Multi-GPU Deployment?

from utils import load_model_on_gpus
model = load_model_on_gpus("THUDM/chatglm-6b", num_gpus=2)

Aug 16 '23 01:08 zhangch9

ChatGLM-6B ChatGLM-6B copied to clipboard

Distributed deployment reasoning

ChatGLM-6B
ChatGLM-6B copied to clipboard