ChatGLM-6B
ChatGLM-6B copied to clipboard
Distributed deployment reasoning
Great work! Does it support distributed single-machine multi-card reasoning?
I have a simple try:
import os
import platform
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True,cache_dir='/home/sss/ModelWeight')
model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True,cache_dir='/home/sss/ModelWeight',device_map="auto").half().quantize(8)#.cuda()
model = model.eval()
os_name = platform.system()
history = []
print("欢迎使用 ChatGLM-6B 模型,输入内容即可进行对话,clear 清空对话历史,stop 终止程序")
while True:
query = input("\n用户:")
if query == "stop":
break
if query == "clear":
history = []
command = 'cls' if os_name == 'Windows' else 'clear'
os.system(command)
print("欢迎使用 ChatGLM-6B 模型,输入内容即可进行对话,clear 清空对话历史,stop 终止程序")
continue
response, history = model.chat(tokenizer, query, history=history)
print(f"ChatGLM-6B:{response}")
and get:
Traceback (most recent call last):
File "/home/sss/ChatGLM-6B/cli_demo.py", line 23, in <module>
response, history = model.chat(tokenizer, query, history=history)
File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/modeling_chatglm.py", line 1109, in chat
outputs = self.generate(**input_ids, **gen_kwargs)
File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/modeling_chatglm.py", line 1133, in generate
output_ids = super().generate(**kwargs)
File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/transformers/generation/utils.py", line 1437, in generate
return self.sample(
File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/transformers/generation/utils.py", line 2443, in sample
outputs = self(
File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/modeling_chatglm.py", line 1029, in forward
transformer_outputs = self.transformer(
File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/modeling_chatglm.py", line 874, in forward
layer_ret = layer(
File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/modeling_chatglm.py", line 573, in forward
attention_outputs = self.attention(
File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/modeling_chatglm.py", line 398, in forward
mixed_raw_layer = self.query_key_value(hidden_states)
File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/quantization.py", line 137, in forward
output = W8A16Linear.apply(input, self.weight, self.weight_scale, self.weight_bit_width)
File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/quantization.py", line 22, in forward
output = inp.mm(weight.t())
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument mat2 in method wrapper_mm)
Not that easy, sigh.
You have commented out the cuda() in your code. Why does it still run on cuda?
Seems the error comes from quantization. Maybe you can remove .quantize(8) and try again.
Disclaimer: I'm not familiar with torch.
I have a simple try:
import os import platform from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True,cache_dir='/home/sss/ModelWeight') model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True,cache_dir='/home/sss/ModelWeight',device_map="auto").half().quantize(8)#.cuda() model = model.eval() os_name = platform.system() history = [] print("欢迎使用 ChatGLM-6B 模型,输入内容即可进行对话,clear 清空对话历史,stop 终止程序") while True: query = input("\n用户:") if query == "stop": break if query == "clear": history = [] command = 'cls' if os_name == 'Windows' else 'clear' os.system(command) print("欢迎使用 ChatGLM-6B 模型,输入内容即可进行对话,clear 清空对话历史,stop 终止程序") continue response, history = model.chat(tokenizer, query, history=history) print(f"ChatGLM-6B:{response}")and get:
Traceback (most recent call last): File "/home/sss/ChatGLM-6B/cli_demo.py", line 23, in <module> response, history = model.chat(tokenizer, query, history=history) File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/modeling_chatglm.py", line 1109, in chat outputs = self.generate(**input_ids, **gen_kwargs) File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/modeling_chatglm.py", line 1133, in generate output_ids = super().generate(**kwargs) File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/transformers/generation/utils.py", line 1437, in generate return self.sample( File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/transformers/generation/utils.py", line 2443, in sample outputs = self( File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/modeling_chatglm.py", line 1029, in forward transformer_outputs = self.transformer( File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/modeling_chatglm.py", line 874, in forward layer_ret = layer( File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/modeling_chatglm.py", line 573, in forward attention_outputs = self.attention( File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/modeling_chatglm.py", line 398, in forward mixed_raw_layer = self.query_key_value(hidden_states) File "/home/sss/.conda/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/quantization.py", line 137, in forward output = W8A16Linear.apply(input, self.weight, self.weight_scale, self.weight_bit_width) File "/home/sss/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/c3dece3f011574955ccba53b9224ee9f78dfd54e/quantization.py", line 22, in forward output = inp.mm(weight.t()) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument mat2 in method wrapper_mm)Not that easy, sigh.
@duanqn At the line
model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True,cache_dir='/home/sss/ModelWeight',device_map="auto").half().quantize(8)#.cuda() where device_map="auto" is used, which means using auto multiple GPU assignment mode so no need of manual cuda().
I have tried half() but get same error.
Can you please try the code mentioned in the Multi-GPU Deployment?
from utils import load_model_on_gpus
model = load_model_on_gpus("THUDM/chatglm-6b", num_gpus=2)