transformers Recent changes is causing "found at least two devices"

System Info

transformers 4.43.3, python 3.10, linux

Who can help?

@ArthurZucker

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I have received multiple reports that model loading behaviour recently changed which is causing a device error. This can usually be fixed by specifying the device_map, but prior to recent changes (I don't know when this happened), the model was loaded and could inference without any issues on multiple GPUs.

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0

Referenced issues: https://github.com/casper-hansen/AutoAWQ/issues/510 https://github.com/casper-hansen/AutoAWQ/issues/558 https://github.com/casper-hansen/AutoAWQ/issues/571

Expected behavior

The expected behavior is that we do not see these errors with the default settings of device_map=None. I am generally not sure what exactly changed, so it is hard to be more precise

Aug 05 '24 05:08 casper-hansen

Thanks for reporting, could you try with the latest release? Otherwise, sorry for the inconvenience and cc @SunMarc !

Aug 26 '24 15:08 ArthurZucker

Thanks for reporting, could you try with the latest release? Otherwise, sorry for the inconvenience and cc @SunMarc !

I tested this today, and it's still an issue with the latest transformers release (v4.44.2) at time of writing.

Aug 28 '24 02:08 davedgd

😢 I don't know if this is accelerate or not so pinging @muellerzr as well

Aug 28 '24 09:08 ArthurZucker

Looks like AWQ is another model that can't be fast-loaded. Will put in a fix

Aug 28 '24 10:08 muellerzr

Potentially. I'm not too familiar with the AWQ codebase. The PR that likely broke this is here: https://github.com/huggingface/transformers/pull/31771

In the model definition we need to set _supports_param_buffer_assignment = False, which needs to be done on the AWQ side

Aug 28 '24 11:08 muellerzr

@muellerzr For reference, this error occurs when we want to run inference to quantize the model. I have not received similar reports for inference of quantized models. So in other words, this is happening when running models with BF16/FP16.

EDIT: Model loading is quite standard use of the auto class in transformers.

    @classmethod
    def from_pretrained(
        self,
        model_path: Annotated[str, Doc("A Huggingface path or local path to a model.")],
        model_type: Annotated[str, Doc("The model type, loaded from config.json.")],
        torch_dtype: Annotated[
            torch.dtype,
            Doc(
                "The dtype to load the model as. May not work with other values than float16."
            ),
        ] = torch.float16,
        trust_remote_code: Annotated[
            bool,
            Doc(
                "Useful for Huggingface repositories that have not been integrated into transformers yet."
            ),
        ] = True,
        safetensors: Annotated[
            bool, Doc("Whether to download/load safetensors instead of torch weights.")
        ] = True,
        device_map: Annotated[
            Union[str, Dict],
            Doc(
                "A device map that will be passed onto the model loading method from transformers."
            ),
        ] = None,
        download_kwargs: Annotated[
            Dict,
            Doc("Used for configure download model"),
        ] = None,
        **model_init_kwargs: Annotated[
            Dict,
            Doc(
                "Additional kwargs that are passed to the model during initialization."
            ),
        ],
    ):
        """A method for initialization of pretrained models, usually in FP16."""
        # Get weights path and quant config
        model_weights_path, config, quant_config = self._load_config(
            self,
            model_path,
            "",
            safetensors,
            trust_remote_code=trust_remote_code,
            download_kwargs=download_kwargs,
        )

        target_cls_name = TRANSFORMERS_AUTO_MAPPING_DICT[config.model_type]
        target_cls = getattr(transformers, target_cls_name)

        processor = None
        if target_cls_name == "AutoModelForVision2Seq":
            processor = AutoProcessor.from_pretrained(model_weights_path)
            processor: CLIPImageProcessor = processor.image_processor

        # If not quantized, must load with AutoModelForCausalLM
        model = target_cls.from_pretrained(
            model_weights_path,
            trust_remote_code=trust_remote_code,
            torch_dtype=torch_dtype,
            use_safetensors=safetensors,
            device_map=device_map,
            **model_init_kwargs,
        )

        model.eval()

        return self(
            model,
            model_type,
            is_quantized=False,
            config=config,
            quant_config=quant_config,
            processor=processor,
        )

Aug 29 '24 10:08 casper-hansen

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sep 23 '24 08:09 github-actions[bot]

Has there been any fix for this? This is also affecting autogptq: #729

Sep 26 '24 13:09 JeevanBhoot

I've just added a pull request for a patch that I believe resolves this issue. Feel free to try installing this patch to confirm the fix: https://github.com/davedgd/transformers/tree/patch-1

@muellerzr: Please note I did try to set _supports_param_buffer_assignment = False on the AWQ side based on your suggestion, but this appeared to be a red herring in my testing.

Sep 27 '24 01:09 davedgd

I am also encountering this issue when using dynamic rope scaling and here is what's happening:

During LlamaAttention.__init__(), the LlamaRotaryEmbedding module is initialized. No device arg is provided: https://github.com/huggingface/transformers/blob/d6ba1ac041ac0b07bc589dd82a67cfb76f75d0f9/src/transformers/models/llama/modeling_llama.py#L304
In LlamaRotaryEmbedding.__init__(), the inv_freq and original_inv_freq tensors are created and since device is not provided they are placed on the cpu
During execution, in _dynamic_frequency_update() if inv_freq is growing, then the tensor will be recomputed and placed correctly on the specified device (cuda for me) . https://github.com/huggingface/transformers/blob/d6ba1ac041ac0b07bc589dd82a67cfb76f75d0f9/src/transformers/models/llama/modeling_llama.py#L135-L136
However, if it needs to reset, the original_inv_freq is used which as I mentioned was placed on the cpu. This would cause the two device error in the forward() call. https://github.com/huggingface/transformers/blob/d6ba1ac041ac0b07bc589dd82a67cfb76f75d0f9/src/transformers/models/llama/modeling_llama.py#L142

This can be fixed with a simple change like this: https://github.com/trevor-m/transformers/commit/1a7e62ab585a2b6ff0feab3256a0946449ee6954

Oct 07 '24 23:10 trevor-m

Thanks for the nice report !It seems that there is indeed a device mismatch here. However, one point I don't get is why the original_inv_freq still stay on the cpu if we move the whole model to the cuda ? Could you share a reproducer ? That would be very helpful ! Then, if the model is shared between different devices thanks to accelerate hooks, we shouldn't have issues. The only issue happens when the rope module is still on the cpu while the rest of the model is on cuda without accelerate hooks. Then it is normal that we get device mismatch.

Oct 08 '24 14:10 SunMarc

Let me try to make a small reproducer. I wonder if using register_buffer for the original_inv_freq would allow it to move alongside the whole model when we change devices.

Oct 10 '24 19:10 trevor-m

Please ignore my earlier comment (I deleted it to avoid confusion) -- it turns out there are multiple issues with AutoAWQ that are complicating my testing of relevant fixes, and I need to more thoroughly evaluate what's going on to figure out the best solution(s).

Oct 13 '24 20:10 davedgd

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Nov 07 '24 08:11 github-actions[bot]

+1, qwen2.5-72b-instruct

Dec 03 '24 13:12 ArlanCooper

has solve it?

Dec 03 '24 13:12 ArlanCooper

Is it the same as #35505? 🤗

Jan 13 '25 08:01 ArthurZucker

transformers transformers copied to clipboard

Recent changes is causing "found at least two devices"

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard