transformers
transformers copied to clipboard
Recent changes is causing "found at least two devices"
System Info
transformers 4.43.3, python 3.10, linux
Who can help?
@ArthurZucker
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
I have received multiple reports that model loading behaviour recently changed which is causing a device error. This can usually be fixed by specifying the device_map, but prior to recent changes (I don't know when this happened), the model was loaded and could inference without any issues on multiple GPUs.
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0
Referenced issues: https://github.com/casper-hansen/AutoAWQ/issues/510 https://github.com/casper-hansen/AutoAWQ/issues/558 https://github.com/casper-hansen/AutoAWQ/issues/571
Expected behavior
The expected behavior is that we do not see these errors with the default settings of device_map=None. I am generally not sure what exactly changed, so it is hard to be more precise
Thanks for reporting, could you try with the latest release? Otherwise, sorry for the inconvenience and cc @SunMarc !
Thanks for reporting, could you try with the latest release? Otherwise, sorry for the inconvenience and cc @SunMarc !
I tested this today, and it's still an issue with the latest transformers release (v4.44.2) at time of writing.
😢 I don't know if this is accelerate or not so pinging @muellerzr as well
Looks like AWQ is another model that can't be fast-loaded. Will put in a fix
Potentially. I'm not too familiar with the AWQ codebase. The PR that likely broke this is here: https://github.com/huggingface/transformers/pull/31771
In the model definition we need to set _supports_param_buffer_assignment = False, which needs to be done on the AWQ side
@muellerzr For reference, this error occurs when we want to run inference to quantize the model. I have not received similar reports for inference of quantized models. So in other words, this is happening when running models with BF16/FP16.
EDIT: Model loading is quite standard use of the auto class in transformers.
@classmethod
def from_pretrained(
self,
model_path: Annotated[str, Doc("A Huggingface path or local path to a model.")],
model_type: Annotated[str, Doc("The model type, loaded from config.json.")],
torch_dtype: Annotated[
torch.dtype,
Doc(
"The dtype to load the model as. May not work with other values than float16."
),
] = torch.float16,
trust_remote_code: Annotated[
bool,
Doc(
"Useful for Huggingface repositories that have not been integrated into transformers yet."
),
] = True,
safetensors: Annotated[
bool, Doc("Whether to download/load safetensors instead of torch weights.")
] = True,
device_map: Annotated[
Union[str, Dict],
Doc(
"A device map that will be passed onto the model loading method from transformers."
),
] = None,
download_kwargs: Annotated[
Dict,
Doc("Used for configure download model"),
] = None,
**model_init_kwargs: Annotated[
Dict,
Doc(
"Additional kwargs that are passed to the model during initialization."
),
],
):
"""A method for initialization of pretrained models, usually in FP16."""
# Get weights path and quant config
model_weights_path, config, quant_config = self._load_config(
self,
model_path,
"",
safetensors,
trust_remote_code=trust_remote_code,
download_kwargs=download_kwargs,
)
target_cls_name = TRANSFORMERS_AUTO_MAPPING_DICT[config.model_type]
target_cls = getattr(transformers, target_cls_name)
processor = None
if target_cls_name == "AutoModelForVision2Seq":
processor = AutoProcessor.from_pretrained(model_weights_path)
processor: CLIPImageProcessor = processor.image_processor
# If not quantized, must load with AutoModelForCausalLM
model = target_cls.from_pretrained(
model_weights_path,
trust_remote_code=trust_remote_code,
torch_dtype=torch_dtype,
use_safetensors=safetensors,
device_map=device_map,
**model_init_kwargs,
)
model.eval()
return self(
model,
model_type,
is_quantized=False,
config=config,
quant_config=quant_config,
processor=processor,
)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Has there been any fix for this? This is also affecting autogptq: #729
I've just added a pull request for a patch that I believe resolves this issue. Feel free to try installing this patch to confirm the fix: https://github.com/davedgd/transformers/tree/patch-1
@muellerzr: Please note I did try to set _supports_param_buffer_assignment = False on the AWQ side based on your suggestion, but this appeared to be a red herring in my testing.
I am also encountering this issue when using dynamic rope scaling and here is what's happening:
- During
LlamaAttention.__init__(), theLlamaRotaryEmbeddingmodule is initialized. Nodevicearg is provided: https://github.com/huggingface/transformers/blob/d6ba1ac041ac0b07bc589dd82a67cfb76f75d0f9/src/transformers/models/llama/modeling_llama.py#L304 - In
LlamaRotaryEmbedding.__init__(), theinv_freqandoriginal_inv_freqtensors are created and sincedeviceis not provided they are placed on the cpu - During execution, in
_dynamic_frequency_update()ifinv_freqis growing, then the tensor will be recomputed and placed correctly on the specified device (cuda for me) . https://github.com/huggingface/transformers/blob/d6ba1ac041ac0b07bc589dd82a67cfb76f75d0f9/src/transformers/models/llama/modeling_llama.py#L135-L136 - However, if it needs to reset, the
original_inv_freqis used which as I mentioned was placed on the cpu. This would cause the two device error in the forward() call. https://github.com/huggingface/transformers/blob/d6ba1ac041ac0b07bc589dd82a67cfb76f75d0f9/src/transformers/models/llama/modeling_llama.py#L142
This can be fixed with a simple change like this: https://github.com/trevor-m/transformers/commit/1a7e62ab585a2b6ff0feab3256a0946449ee6954
Thanks for the nice report !It seems that there is indeed a device mismatch here. However, one point I don't get is why the original_inv_freq still stay on the cpu if we move the whole model to the cuda ? Could you share a reproducer ? That would be very helpful !
Then, if the model is shared between different devices thanks to accelerate hooks, we shouldn't have issues. The only issue happens when the rope module is still on the cpu while the rest of the model is on cuda without accelerate hooks. Then it is normal that we get device mismatch.
Let me try to make a small reproducer. I wonder if using register_buffer for the original_inv_freq would allow it to move alongside the whole model when we change devices.
Please ignore my earlier comment (I deleted it to avoid confusion) -- it turns out there are multiple issues with AutoAWQ that are complicating my testing of relevant fixes, and I need to more thoroughly evaluate what's going on to figure out the best solution(s).
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
+1, qwen2.5-72b-instruct
has solve it?
Is it the same as #35505? 🤗