AutoAWQ RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index

`from awq import AutoAWQForCausalLM from awq.utils.utils import get_best_device from transformers import AutoTokenizer, TextStreamer

quant_path = "/workspace/awq_model"

if get_best_device() == "cpu": model = AutoAWQForCausalLM.from_quantized(quant_path, use_qbits=True, fuse_layers=False) else: model = AutoAWQForCausalLM.from_quantized(quant_path, fuse_layers=True,device_map="balanced") tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True) #初始化流式输出器 streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

prompt = "You're standing on the surface of the Earth. "
"You walk one mile south, one mile west and one mile north. "
"You end up exactly where you started. Where are you?"

chat = [ {"role": "system", "content": "You are a concise assistant that helps answer questions."}, {"role": "user", "content": prompt}, ]

terminators = [ tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>") ]

tokens = tokenizer.apply_chat_template( chat, return_tensors="pt" ) tokens = tokens.to("cuda:0")

generation_output = model.generate( tokens, streamer=streamer, max_new_tokens=64, eos_token_id=terminators ) `

Here's my script for the quantized model，However, I have the following error, how can I fix it?

Jun 19 '24 08:06 xieziyi881

you can try import os os.environ['CUDA_VISIBLE_DEVICES'] = '6' device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') model.to(device)

works for me

Jul 25 '24 08:07 ryan0980

i have to load qwen2.5-72b model with two gpus,how to quant it ?

Dec 03 '24 13:12 ArlanCooper

try 0.2.8 version, and remove the device_map="balanced" parameter, it works for me.

Mar 16 '25 13:03 linfan

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)