CogVLM2
CogVLM2 copied to clipboard
您好,请问只支持单卡推理嘛,我想问两卡好像不行呀
System Info / 系統信息
GPU:V100 32GB * 2 CUDA: 12.2
Who can help? / 谁可以帮助到您?
Information / 问题信息
- [X] The official example scripts / 官方的示例脚本
- [ ] My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
GPU:V100 32GB * 2 CUDA: 12.2
Expected behavior / 期待表现
请问应该怎么多卡推理
用错例子了,用basic_demo/cli_demo_multi_gpus.py
用错例子了,用
basic_demo/cli_demo_multi_gpus.py
您好,我试了报另一个错误,这是我的代码 import torch from PIL import Image from transformers import AutoModelForCausalLM, AutoTokenizer from accelerate import init_empty_weights, load_checkpoint_and_dispatch, infer_auto_device_map
MODEL_PATH = "/mnt/data/spdi-code/paddlechat/cogvlm2-llama3-chat-19B" DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu' TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8 else torch.float16
tokenizer = AutoTokenizer.from_pretrained( MODEL_PATH, trust_remote_code=True )
with init_empty_weights(): model = AutoModelForCausalLM.from_pretrained( MODEL_PATH, trust_remote_code=True, )
num_gpus = torch.cuda.device_count() max_memory_per_gpu = "16GiB" if num_gpus > 2: max_memory_per_gpu = f"{round(42 / num_gpus)}"
device_map = infer_auto_device_map( model=model, max_memory={i: max_memory_per_gpu for i in range(num_gpus)}, no_split_module_classes=["CogVLMDecoderLayer"] ) model = load_checkpoint_and_dispatch(model, MODEL_PATH, device_map=device_map, dtype=TORCH_TYPE) model = model.eval()
text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"
query = text_only_template.format('您好') history = [] input_by_model = model.build_conversation_input_ids( tokenizer, query=query, history=history, template_version='chat' )
inputs = { 'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE), 'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE), 'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE), 'image': None } gen_kwargs = { "max_new_tokens": 2048, "pad_token_id": 128002, } with torch.no_grad(): outputs = model.generate(**inputs, **gen_kwargs) outputs = outputs[:, inputs['input_ids'].shape[1]:] response = tokenizer.decode(outputs[0]) response = response.split("")[0] print("\nCogVLM2:", response) history.append((query, response))
用错例子了,用
basic_demo/cli_demo_multi_gpus.py
分配好显存后,又报新错误
你试试直接打印response ,不要做这个去掉last special token的操作,就是这行去掉呢
你试试直接打印response ,不要做这个去掉last special token的操作,就是这行去掉呢
是的, 应该是response = response.split("<|end_of_text|>")[0],我看文档写的是response = response.split("")[0]
改成 with torch.no_grad(): outputs = model.generate(**inputs, **gen_kwargs) outputs = outputs[:, inputs['input_ids'].shape[1]:] response = tokenizer.decode(outputs[0], skip_special_tokens=True) print("\nCogVLM2:", response) history.append((query, response)) 也行,这样更标准