CogVLM2 icon indicating copy to clipboard operation
CogVLM2 copied to clipboard

您好,请问只支持单卡推理嘛,我想问两卡好像不行呀

Open whysirier opened this issue 1 year ago • 5 comments

System Info / 系統信息

GPU:V100 32GB * 2 CUDA: 12.2

Who can help? / 谁可以帮助到您?

企业微信截图_20240523163630

Information / 问题信息

  • [X] The official example scripts / 官方的示例脚本
  • [ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

GPU:V100 32GB * 2 CUDA: 12.2

Expected behavior / 期待表现

请问应该怎么多卡推理

whysirier avatar May 23 '24 08:05 whysirier

用错例子了,用basic_demo/cli_demo_multi_gpus.py

zRzRzRzRzRzRzR avatar May 23 '24 10:05 zRzRzRzRzRzRzR

用错例子了,用basic_demo/cli_demo_multi_gpus.py

您好,我试了报另一个错误,这是我的代码 import torch from PIL import Image from transformers import AutoModelForCausalLM, AutoTokenizer from accelerate import init_empty_weights, load_checkpoint_and_dispatch, infer_auto_device_map

MODEL_PATH = "/mnt/data/spdi-code/paddlechat/cogvlm2-llama3-chat-19B" DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu' TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8 else torch.float16

tokenizer = AutoTokenizer.from_pretrained( MODEL_PATH, trust_remote_code=True )

with init_empty_weights(): model = AutoModelForCausalLM.from_pretrained( MODEL_PATH, trust_remote_code=True, )

num_gpus = torch.cuda.device_count() max_memory_per_gpu = "16GiB" if num_gpus > 2: max_memory_per_gpu = f"{round(42 / num_gpus)}"

device_map = infer_auto_device_map( model=model, max_memory={i: max_memory_per_gpu for i in range(num_gpus)}, no_split_module_classes=["CogVLMDecoderLayer"] ) model = load_checkpoint_and_dispatch(model, MODEL_PATH, device_map=device_map, dtype=TORCH_TYPE) model = model.eval()

text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"

query = text_only_template.format('您好') history = [] input_by_model = model.build_conversation_input_ids( tokenizer, query=query, history=history, template_version='chat' )

inputs = { 'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE), 'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE), 'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE), 'image': None } gen_kwargs = { "max_new_tokens": 2048, "pad_token_id": 128002, } with torch.no_grad(): outputs = model.generate(**inputs, **gen_kwargs) outputs = outputs[:, inputs['input_ids'].shape[1]:] response = tokenizer.decode(outputs[0]) response = response.split("")[0] print("\nCogVLM2:", response) history.append((query, response))

企业微信截图_20240527094115

whysirier avatar May 27 '24 02:05 whysirier

用错例子了,用basic_demo/cli_demo_multi_gpus.py

分配好显存后,又报新错误 企业微信截图_20240527112118

whysirier avatar May 27 '24 03:05 whysirier

你试试直接打印response ,不要做这个去掉last special token的操作,就是这行去掉呢

zRzRzRzRzRzRzR avatar May 27 '24 03:05 zRzRzRzRzRzRzR

你试试直接打印response ,不要做这个去掉last special token的操作,就是这行去掉呢

是的, 应该是response = response.split("<|end_of_text|>")[0],我看文档写的是response = response.split("")[0]

whysirier avatar May 27 '24 06:05 whysirier

改成 with torch.no_grad(): outputs = model.generate(**inputs, **gen_kwargs) outputs = outputs[:, inputs['input_ids'].shape[1]:] response = tokenizer.decode(outputs[0], skip_special_tokens=True) print("\nCogVLM2:", response) history.append((query, response)) 也行,这样更标准

zRzRzRzRzRzRzR avatar May 28 '24 03:05 zRzRzRzRzRzRzR