AutoAWQ icon indicating copy to clipboard operation
AutoAWQ copied to clipboard

Quantitative model report wrong, RuntimeError: Expected all tensors to be on the same device

Open ShelterWFF opened this issue 1 year ago • 23 comments

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
from transformers import AwqConfig, AutoConfig
import torch

model_path = ''
quant_path = ''
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(
    model_path,
    # trust_remote_code=True,
    low_cpu_mem_usage=True,
    use_cache=False,
    # device_map='cuda:0',
    # torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')

ShelterWFF avatar Jul 28 '24 17:07 ShelterWFF

reduce transformers verison

ShelterWFF avatar Jul 28 '24 18:07 ShelterWFF

the same problem. version 0.2.6 installs transformers 4.43.3 which gives an error during quantization. In this case, it is the quantization code that does not work, but the inference code works fine. Tested on two different machines. It is solved by reinstalling transformers version 4.42.4

It should not be like this @casper-hansen

r4dm avatar Jul 29 '24 07:07 r4dm

The default loading of the model in transformers seems to have changed recently. For now, you can just use device_map when needed.

casper-hansen avatar Jul 29 '24 08:07 casper-hansen

This also did not help in my case. I quantize the 70b model on one A100 and with default settings this used to happen normally. With new version autoawq and transformers If I specify the device map on the CPU, then another error appears

Traceback (most recent call last):
  File "/home/jupyter/training/to_awq.py", line 17, in <module>
    model.quantize(tokenizer, quant_config=quant_config)
  File "/home/administrator/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1709, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'LlamaForCausalLM' object has no attribute 'quantize'. Did you mean: 'dequantize'?

And if I specify the device map on the GPU - OOM

As indicated above, the problem is solved by downgrading the transformers, for me this is not a problem, but it seems that for general use this is not normal

r4dm avatar Jul 29 '24 09:07 r4dm

Similar issue with following environments:

transfermers 4.42.4
AutoAWQ 0.2.6+cu118 
AutoAWQ_Kernels 0.0.6+cu118

loading with device_map auto

model = AutoAWQForCausalLM.from_pretrained(config.model_path, device_map="auto", safetensors=True)

error solved by specifying device

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

But what if the model is larger than 80GB(e.g. qwen2-72b)?

FoolMark avatar Aug 01 '24 06:08 FoolMark

convert meta-llama/Meta-Llama-3.1-70B-Instruct transformers must be upgraded to 4.43.x. When I use 4.43.3, I get the same error.

billvsme avatar Aug 01 '24 07:08 billvsme

@billvsme I'm using meta-llama/Meta-Llama-3.1-70B-Instruct and i got the same error even i tried transformers==4.43.3 and 4.44.0. do i need to specify my entire env?

seolhokim avatar Aug 11 '24 10:08 seolhokim

same issue @r4dm solution doesn't work for me as I m trying to quantize a llama3.1 fine-tuned model.

supa-thibaud avatar Aug 19 '24 09:08 supa-thibaud

Unfortunately simply installing transformers==4.42.4 doesn't work for Llama3.1 as this reintroduces an issue with rope_scaling.

ValueError: rope_scaling must be a dictionary with two fields, type and factor, got {'factor': 8.0, 'high_freq_factor': 4.0, 'low_freq_factor': 1.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}

Setting device_map="auto" in the model loading unfortunately doesn't work with latest transformers.

William-Wildridge avatar Aug 19 '24 10:08 William-Wildridge

temporary solution works only with llama 3, not 3.1

Because support for 3.1 was added in transformers v4.43.0

r4dm avatar Aug 19 '24 12:08 r4dm

For anyone watching this, consider also tracking this issue in transformers: #32420

davedgd avatar Sep 05 '24 21:09 davedgd

Same issue, but if you have enough vram or multi-gpu you can set device_map="auto" then it should work. CPU+GPU quantization for llama 3.1 is still broken as far as I know

bkutasi avatar Sep 12 '24 14:09 bkutasi

I have a potential fix that may remedy both the "two devices" error and the rope_scaling issue (by way of allowing for a newer transformers version). Feel free to try out the patch here:

https://github.com/davedgd/transformers/tree/patch-1

e.g.,

pip install git+https://github.com/davedgd/transformers@patch-1

davedgd avatar Sep 27 '24 01:09 davedgd

+1

ArlanCooper avatar Dec 03 '24 13:12 ArlanCooper

Same issue

steveepreston avatar Jan 10 '25 17:01 steveepreston

Same issue

This was fully fixed on recent versions. Can you confirm what version of autoawq you are using and provide a code sample? I can probably help you resolve it.

davedgd avatar Jan 10 '25 17:01 davedgd

Hey @davedgd

my mistake, i'm not using AutoAWQ i was trying to model.generate() on an input that was on CPU, then i used pipeline() and it fixed.

steveepreston avatar Jan 10 '25 18:01 steveepreston

Hey @davedgd

my mistake, i'm not using AutoAWQ i was trying to model.generate() on an input that was on CPU, then i used pipeline() and it fixed.

No worries — glad to hear you figured it out!

davedgd avatar Jan 10 '25 18:01 davedgd

@davedgd Thank you!

steveepreston avatar Jan 10 '25 19:01 steveepreston

still facing this problem :(

  File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 183, in forward
    freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(2, 3)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
``` and this is my code 
model_path = 'Qwen/Qwen2-VL-7B-Instruct'
torch.cuda.empty_cache()

# Load model
model = AutoAWQForCausalLM.from_pretrained(
        model_path, safetensors=True, torch_dtype=torch.float16, device_map="auto",

)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

yupbank avatar Jan 20 '25 21:01 yupbank

still facing this problem :(

  File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 183, in forward
    freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(2, 3)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
``` and this is my code 
model_path = 'Qwen/Qwen2-VL-7B-Instruct'
torch.cuda.empty_cache()

# Load model
model = AutoAWQForCausalLM.from_pretrained(
       model_path, safetensors=True, torch_dtype=torch.float16, device_map="auto",

)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

Try it without device_map="auto" in AutoAWQForCausalLM.from_pretrained, e.g.,

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)

You can probably use your other args, but they shouldn't be needed. Definitely avoid device_map.

davedgd avatar Jan 21 '25 00:01 davedgd

without device_map="auto" worked... which is mysterious to me.. how come?

yupbank avatar Jan 21 '25 14:01 yupbank

without device_map="auto" worked... which is mysterious to me.. how come?

The answer is technical, but long story short, the adjustment was made in the multi-gpu fix by @casper-hansen from a few versions back in the 0.2.7 releases. Not using device_map="auto" aligns with the current examples as well:

https://github.com/casper-hansen/AutoAWQ/blob/main/examples/quantize.py

davedgd avatar Jan 21 '25 15:01 davedgd