transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Recent changes is causing "found at least two devices"

Open casper-hansen opened this issue 1 year ago • 2 comments

System Info

transformers 4.43.3, python 3.10, linux

Who can help?

@ArthurZucker

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

I have received multiple reports that model loading behaviour recently changed which is causing a device error. This can usually be fixed by specifying the device_map, but prior to recent changes (I don't know when this happened), the model was loaded and could inference without any issues on multiple GPUs.

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0

Referenced issues: https://github.com/casper-hansen/AutoAWQ/issues/510 https://github.com/casper-hansen/AutoAWQ/issues/558 https://github.com/casper-hansen/AutoAWQ/issues/571

Expected behavior

The expected behavior is that we do not see these errors with the default settings of device_map=None. I am generally not sure what exactly changed, so it is hard to be more precise

casper-hansen avatar Aug 05 '24 05:08 casper-hansen

Thanks for reporting, could you try with the latest release? Otherwise, sorry for the inconvenience and cc @SunMarc !

ArthurZucker avatar Aug 26 '24 15:08 ArthurZucker

Thanks for reporting, could you try with the latest release? Otherwise, sorry for the inconvenience and cc @SunMarc !

I tested this today, and it's still an issue with the latest transformers release (v4.44.2) at time of writing.

davedgd avatar Aug 28 '24 02:08 davedgd

😢 I don't know if this is accelerate or not so pinging @muellerzr as well

ArthurZucker avatar Aug 28 '24 09:08 ArthurZucker

Looks like AWQ is another model that can't be fast-loaded. Will put in a fix

muellerzr avatar Aug 28 '24 10:08 muellerzr

Potentially. I'm not too familiar with the AWQ codebase. The PR that likely broke this is here: https://github.com/huggingface/transformers/pull/31771

In the model definition we need to set _supports_param_buffer_assignment = False, which needs to be done on the AWQ side

muellerzr avatar Aug 28 '24 11:08 muellerzr

@muellerzr For reference, this error occurs when we want to run inference to quantize the model. I have not received similar reports for inference of quantized models. So in other words, this is happening when running models with BF16/FP16.

EDIT: Model loading is quite standard use of the auto class in transformers.

    @classmethod
    def from_pretrained(
        self,
        model_path: Annotated[str, Doc("A Huggingface path or local path to a model.")],
        model_type: Annotated[str, Doc("The model type, loaded from config.json.")],
        torch_dtype: Annotated[
            torch.dtype,
            Doc(
                "The dtype to load the model as. May not work with other values than float16."
            ),
        ] = torch.float16,
        trust_remote_code: Annotated[
            bool,
            Doc(
                "Useful for Huggingface repositories that have not been integrated into transformers yet."
            ),
        ] = True,
        safetensors: Annotated[
            bool, Doc("Whether to download/load safetensors instead of torch weights.")
        ] = True,
        device_map: Annotated[
            Union[str, Dict],
            Doc(
                "A device map that will be passed onto the model loading method from transformers."
            ),
        ] = None,
        download_kwargs: Annotated[
            Dict,
            Doc("Used for configure download model"),
        ] = None,
        **model_init_kwargs: Annotated[
            Dict,
            Doc(
                "Additional kwargs that are passed to the model during initialization."
            ),
        ],
    ):
        """A method for initialization of pretrained models, usually in FP16."""
        # Get weights path and quant config
        model_weights_path, config, quant_config = self._load_config(
            self,
            model_path,
            "",
            safetensors,
            trust_remote_code=trust_remote_code,
            download_kwargs=download_kwargs,
        )

        target_cls_name = TRANSFORMERS_AUTO_MAPPING_DICT[config.model_type]
        target_cls = getattr(transformers, target_cls_name)

        processor = None
        if target_cls_name == "AutoModelForVision2Seq":
            processor = AutoProcessor.from_pretrained(model_weights_path)
            processor: CLIPImageProcessor = processor.image_processor

        # If not quantized, must load with AutoModelForCausalLM
        model = target_cls.from_pretrained(
            model_weights_path,
            trust_remote_code=trust_remote_code,
            torch_dtype=torch_dtype,
            use_safetensors=safetensors,
            device_map=device_map,
            **model_init_kwargs,
        )

        model.eval()

        return self(
            model,
            model_type,
            is_quantized=False,
            config=config,
            quant_config=quant_config,
            processor=processor,
        )

casper-hansen avatar Aug 29 '24 10:08 casper-hansen

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Sep 23 '24 08:09 github-actions[bot]

Has there been any fix for this? This is also affecting autogptq: #729

JeevanBhoot avatar Sep 26 '24 13:09 JeevanBhoot

I've just added a pull request for a patch that I believe resolves this issue. Feel free to try installing this patch to confirm the fix: https://github.com/davedgd/transformers/tree/patch-1

@muellerzr: Please note I did try to set _supports_param_buffer_assignment = False on the AWQ side based on your suggestion, but this appeared to be a red herring in my testing.

davedgd avatar Sep 27 '24 01:09 davedgd

I am also encountering this issue when using dynamic rope scaling and here is what's happening:

  1. During LlamaAttention.__init__(), the LlamaRotaryEmbedding module is initialized. No device arg is provided: https://github.com/huggingface/transformers/blob/d6ba1ac041ac0b07bc589dd82a67cfb76f75d0f9/src/transformers/models/llama/modeling_llama.py#L304
  2. In LlamaRotaryEmbedding.__init__(), the inv_freq and original_inv_freq tensors are created and since device is not provided they are placed on the cpu
  3. During execution, in _dynamic_frequency_update() if inv_freq is growing, then the tensor will be recomputed and placed correctly on the specified device (cuda for me) . https://github.com/huggingface/transformers/blob/d6ba1ac041ac0b07bc589dd82a67cfb76f75d0f9/src/transformers/models/llama/modeling_llama.py#L135-L136
  4. However, if it needs to reset, the original_inv_freq is used which as I mentioned was placed on the cpu. This would cause the two device error in the forward() call. https://github.com/huggingface/transformers/blob/d6ba1ac041ac0b07bc589dd82a67cfb76f75d0f9/src/transformers/models/llama/modeling_llama.py#L142

This can be fixed with a simple change like this: https://github.com/trevor-m/transformers/commit/1a7e62ab585a2b6ff0feab3256a0946449ee6954

trevor-m avatar Oct 07 '24 23:10 trevor-m

Thanks for the nice report !It seems that there is indeed a device mismatch here. However, one point I don't get is why the original_inv_freq still stay on the cpu if we move the whole model to the cuda ? Could you share a reproducer ? That would be very helpful ! Then, if the model is shared between different devices thanks to accelerate hooks, we shouldn't have issues. The only issue happens when the rope module is still on the cpu while the rest of the model is on cuda without accelerate hooks. Then it is normal that we get device mismatch.

SunMarc avatar Oct 08 '24 14:10 SunMarc

Let me try to make a small reproducer. I wonder if using register_buffer for the original_inv_freq would allow it to move alongside the whole model when we change devices.

trevor-m avatar Oct 10 '24 19:10 trevor-m

Please ignore my earlier comment (I deleted it to avoid confusion) -- it turns out there are multiple issues with AutoAWQ that are complicating my testing of relevant fixes, and I need to more thoroughly evaluate what's going on to figure out the best solution(s).

davedgd avatar Oct 13 '24 20:10 davedgd

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Nov 07 '24 08:11 github-actions[bot]

+1, qwen2.5-72b-instruct

ArlanCooper avatar Dec 03 '24 13:12 ArlanCooper

has solve it?

ArlanCooper avatar Dec 03 '24 13:12 ArlanCooper

Is it the same as #35505? 🤗

ArthurZucker avatar Jan 13 '25 08:01 ArthurZucker