AutoAWQ
AutoAWQ copied to clipboard
Quantitative model report wrong, RuntimeError: Expected all tensors to be on the same device
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
from transformers import AwqConfig, AutoConfig
import torch
model_path = ''
quant_path = ''
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
# Load model
model = AutoAWQForCausalLM.from_pretrained(
model_path,
# trust_remote_code=True,
low_cpu_mem_usage=True,
use_cache=False,
# device_map='cuda:0',
# torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f'Model is quantized and saved at "{quant_path}"')
reduce transformers verison
the same problem. version 0.2.6 installs transformers 4.43.3 which gives an error during quantization. In this case, it is the quantization code that does not work, but the inference code works fine. Tested on two different machines. It is solved by reinstalling transformers version 4.42.4
It should not be like this @casper-hansen
The default loading of the model in transformers seems to have changed recently. For now, you can just use device_map when needed.
This also did not help in my case. I quantize the 70b model on one A100 and with default settings this used to happen normally. With new version autoawq and transformers If I specify the device map on the CPU, then another error appears
Traceback (most recent call last):
File "/home/jupyter/training/to_awq.py", line 17, in <module>
model.quantize(tokenizer, quant_config=quant_config)
File "/home/administrator/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1709, in __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'LlamaForCausalLM' object has no attribute 'quantize'. Did you mean: 'dequantize'?
And if I specify the device map on the GPU - OOM
As indicated above, the problem is solved by downgrading the transformers, for me this is not a problem, but it seems that for general use this is not normal
Similar issue with following environments:
transfermers 4.42.4
AutoAWQ 0.2.6+cu118
AutoAWQ_Kernels 0.0.6+cu118
loading with device_map auto
model = AutoAWQForCausalLM.from_pretrained(config.model_path, device_map="auto", safetensors=True)
error solved by specifying device
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
But what if the model is larger than 80GB(e.g. qwen2-72b)?
convert meta-llama/Meta-Llama-3.1-70B-Instruct transformers must be upgraded to 4.43.x. When I use 4.43.3, I get the same error.
@billvsme I'm using meta-llama/Meta-Llama-3.1-70B-Instruct and i got the same error even i tried transformers==4.43.3 and 4.44.0. do i need to specify my entire env?
same issue @r4dm solution doesn't work for me as I m trying to quantize a llama3.1 fine-tuned model.
Unfortunately simply installing transformers==4.42.4 doesn't work for Llama3.1 as this reintroduces an issue with rope_scaling.
ValueError: rope_scaling must be a dictionary with two fields, type and factor, got {'factor': 8.0, 'high_freq_factor': 4.0, 'low_freq_factor': 1.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}
Setting device_map="auto" in the model loading unfortunately doesn't work with latest transformers.
temporary solution works only with llama 3, not 3.1
Because support for 3.1 was added in transformers v4.43.0
For anyone watching this, consider also tracking this issue in transformers: #32420
Same issue, but if you have enough vram or multi-gpu you can set device_map="auto" then it should work. CPU+GPU quantization for llama 3.1 is still broken as far as I know
I have a potential fix that may remedy both the "two devices" error and the rope_scaling issue (by way of allowing for a newer transformers version). Feel free to try out the patch here:
https://github.com/davedgd/transformers/tree/patch-1
e.g.,
pip install git+https://github.com/davedgd/transformers@patch-1
+1
Same issue
Same issue
This was fully fixed on recent versions. Can you confirm what version of autoawq you are using and provide a code sample? I can probably help you resolve it.
Hey @davedgd
my mistake, i'm not using AutoAWQ
i was trying to model.generate() on an input that was on CPU, then i used pipeline() and it fixed.
Hey @davedgd
my mistake, i'm not using
AutoAWQi was trying tomodel.generate()on an input that was on CPU, then i usedpipeline()and it fixed.
No worries — glad to hear you figured it out!
@davedgd Thank you!
still facing this problem :(
File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 183, in forward
freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(2, 3)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
``` and this is my code
model_path = 'Qwen/Qwen2-VL-7B-Instruct'
torch.cuda.empty_cache()
# Load model
model = AutoAWQForCausalLM.from_pretrained(
model_path, safetensors=True, torch_dtype=torch.float16, device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
still facing this problem :(
File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 183, in forward freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(2, 3) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3! (when checking argument for argument mat2 in method wrapper_CUDA_bmm) ``` and this is my codemodel_path = 'Qwen/Qwen2-VL-7B-Instruct' torch.cuda.empty_cache() # Load model model = AutoAWQForCausalLM.from_pretrained( model_path, safetensors=True, torch_dtype=torch.float16, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained(model_path) # Quantize model.quantize(tokenizer, quant_config=quant_config) # Save quantized model model.save_quantized(quant_path) tokenizer.save_pretrained(quant_path)
Try it without device_map="auto" in AutoAWQForCausalLM.from_pretrained, e.g.,
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
You can probably use your other args, but they shouldn't be needed. Definitely avoid device_map.
without device_map="auto" worked... which is mysterious to me.. how come?
without
device_map="auto"worked... which is mysterious to me.. how come?
The answer is technical, but long story short, the adjustment was made in the multi-gpu fix by @casper-hansen from a few versions back in the 0.2.7 releases. Not using device_map="auto" aligns with the current examples as well:
https://github.com/casper-hansen/AutoAWQ/blob/main/examples/quantize.py