trl DataCollatorForCompletionOnlyLM not working with MPS

Hi there,

amazing work :)

I just encountered an error while trying to run the library on an Apple M3 Max. Below is a MWE to reproduce the error. The example itself doesn't make sense but at least I can consistently get the error.

from transformers import AutoModelForCausalLM, AutoTokenizer, DataCollatorForLanguageModeling
from trl import DataCollatorForCompletionOnlyLM, SFTConfig, SFTTrainer
import torch
from datasets import Dataset

if __name__ == "__main__":

    torch.set_default_device("mps")
    print(f"MPS is available: {torch.backends.mps.is_available()}.")

    train_dataset = Dataset.from_dict(
        {
            "text": [
                "This is a test~",
                "This is another test~",
            ]
        }
    )
    train_dataset.set_format('torch', device="mps")

    model_name = "microsoft/Phi-3-mini-128k-instruct"
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        device_map="mps",
    )
    model.config.use_cache = False

    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        trust_remote_code=True,
        add_eos=True,
        return_tensor="pt",
        padding=True,
    )

    tokenizer.pad_token = tokenizer.bos_token

    training_arguments = SFTConfig(
        output_dir="output",
        optim="adamw_torch",
        group_by_length=True,
        dataset_text_field="text",
        remove_unused_columns=True,
    )

    response_token_id = tokenizer.convert_tokens_to_ids("~")
    collator = DataCollatorForCompletionOnlyLM(
        response_template=[response_token_id], tokenizer=tokenizer
    )
    # collator = DataCollatorForLanguageModeling(
    #     tokenizer=tokenizer,
    #     mlm=False,
    # )

    trainer = SFTTrainer(
        model=model,
        train_dataset=train_dataset,
        tokenizer=tokenizer,
        data_collator=collator,
        args=training_arguments,
    )

    trainer.train()

Error:

TypeError: can't convert mps:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

If you comment the DataCollatorForCompletionOnlyLM and use the DataCollatorForLanguageModeling it works.

Versions:

transformers==4.44.0
trl==0.9.6

Thank you very much in advance!

Best, Giulia

Aug 09 '24 08:08 giuliabaldini

Hi @giuliabaldini, thank you for reporting this issue. Can you share the output of transformers-cli env? If can't reproduce the error locally. Instead, I get

Traceback (most recent call last):
  File "/Users/quentingallouedec/trl/trl/mps.py", line 22, in <module>
    model = AutoModelForCausalLM.from_pretrained(
  File "/usr/local/Caskroom/miniforge/base/envs/trl/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/Caskroom/miniforge/base/envs/trl/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3677, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/usr/local/Caskroom/miniforge/base/envs/trl/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4104, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/usr/local/Caskroom/miniforge/base/envs/trl/lib/python3.10/site-packages/transformers/modeling_utils.py", line 852, in _load_state_dict_into_meta_model
    param = param.to(old_param.dtype)
  File "/usr/local/Caskroom/miniforge/base/envs/trl/lib/python3.10/site-packages/torch/utils/_device.py", line 77, in __torch_function__
    return func(*args, **kwargs)
TypeError: Trying to convert BFloat16 to the MPS backend but it does not have support for that dtype.

Aug 09 '24 08:08 qgallouedec

Hi there, thank you for the quick answer!

This is the output:

- `transformers` version: 4.44.0
- Platform: macOS-14.5-arm64-arm-64bit
- Python version: 3.11.0
- Huggingface_hub version: 0.24.5
- Safetensors version: 0.4.4
- Accelerate version: 0.33.0
- Accelerate config:    not found
- PyTorch version (GPU?): 2.4.0 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: No

Aug 09 '24 09:08 giuliabaldini

About the error, what if you change it to this?

model = AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        device_map="mps",
        torch_dtype=torch.float16,
)

Aug 09 '24 09:08 giuliabaldini

Hi, I just want to clarify the status of this issue:

TRL is primarily optimized for CUDA, and while it may work on MPS, we don't officially support it at this time, nor is it a current priority. What you're encountering could indeed be a bug, but due to the reasons mentioned, it's unlikely to be resolved soon. However, we're leaving this open to signal that TRL is open to contributions that would extend support to MPS.

Oct 20 '24 16:10 qgallouedec

Hopefully the solution below works for you. The error that I get when running this code is: RuntimeError: Placeholder storage has not been allocated on MPS device!

For me this was resolved by removing the line: torch.set_default_device("mps")

The default device before making that setting is cpu. The statement made by that setting is do everything on mps by default, and I'll specify when to use the cpu, an approach that may not align with how trl is implemented. OP and my error message make sense in this context, but qgallouedec - yours is looking more vague.

I'm on an Apple M2 Max, and I hope it works for your M3 as well. trl==0.11.1. System details below.

- `transformers` version: 4.45.0
- Platform: macOS-14.7-arm64-arm-64bit
- Python version: 3.12.0
- Huggingface_hub version: 0.25.1
- Safetensors version: 0.4.3
- Accelerate version: 0.34.1
- Accelerate config:    not found
- PyTorch version (GPU?): 2.6.0.dev20241020 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed

Phi 3's HF model files also have some sample code which has some extra info and settings that might help with training.

Oct 25 '24 01:10 csbell-vu