RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
`from awq import AutoAWQForCausalLM from awq.utils.utils import get_best_device from transformers import AutoTokenizer, TextStreamer
quant_path = "/workspace/awq_model"
if get_best_device() == "cpu": model = AutoAWQForCausalLM.from_quantized(quant_path, use_qbits=True, fuse_layers=False) else: model = AutoAWQForCausalLM.from_quantized(quant_path, fuse_layers=True,device_map="balanced") tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True) #初始化流式输出器 streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
prompt = "You're standing on the surface of the Earth. "
"You walk one mile south, one mile west and one mile north. "
"You end up exactly where you started. Where are you?"
chat = [ {"role": "system", "content": "You are a concise assistant that helps answer questions."}, {"role": "user", "content": prompt}, ]
terminators = [ tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>") ]
tokens = tokenizer.apply_chat_template( chat, return_tensors="pt" ) tokens = tokens.to("cuda:0")
generation_output = model.generate( tokens, streamer=streamer, max_new_tokens=64, eos_token_id=terminators ) `
Here's my script for the quantized model,However, I have the following error, how can I fix it?
you can try
import os os.environ['CUDA_VISIBLE_DEVICES'] = '6' device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') model.to(device)
works for me
i have to load qwen2.5-72b model with two gpus,how to quant it ?
try 0.2.8 version, and remove the device_map="balanced" parameter, it works for me.