transformers device_map='auto' doesn't use MPS backend on Apple M2

With the following program:

import os
import time
import readline
import textwrap


os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
os.environ["HF_ENDPOINT"] = "https://huggingface.co"
os.environ["ACCELERATE_USE_MPS_DEVICE"] = "True"


import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
from accelerate import init_empty_weights, load_checkpoint_and_dispatch, Accelerator

def main():
    print('Pytorch version', torch.__version__)
    if torch.backends.mps.is_available():
        active_device = torch.device('mps')
    elif torch.cuda.is_available():
        active_device = torch.device('cuda', 0)
    else:
        active_device = torch.device('cpu')

    accelerator = Accelerator()
    print('Accelerator device: ', accelerator.device)

    checkpoint = "bigscience/bloom"

    tm_start = time.time()
    tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint,
        device_map="auto",
        offload_folder="offload",
        offload_state_dict=True,
    )
    tm_end = time.time()
    print(f'Loaded in {tm_end - tm_start} seconds.')

    while True:
        prompt = input('Request to LLM: ')

        tm_start = time.time()
        inputs = tokenizer.encode(prompt, return_tensors="pt").to(active_device)
        tm_end = time.time()
        print(f'Encoded in {tm_end - tm_start} seconds.')

        tm_start = time.time()
        outputs = model.generate(
            inputs, max_new_tokens=2048, pad_token_id=tokenizer.eos_token_id,
            repetition_penalty=1.2)
        tm_end = time.time()
        print(f'Generated in {tm_end - tm_start} seconds.')

        tm_start = time.time()
        response = tokenizer.decode(outputs[0])
        tm_end = time.time()
        print(f'Decoded in {tm_end - tm_start} seconds.')

        print("\n".join(textwrap.wrap(response, width=120)))


if __name__ == '__main__':
    main()

the cpu backend is used by transformers/accelerate, even though it prints Accelerator device: mps. I know this because it's slow (below NVMe bandwidth) and the following is printed:

/Users/serge/PycharmProjects/macLLM/venv/lib/python3.9/site-packages/transformers/generation/utils.py:1359:
UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device.
`input_ids` is on mps, whereas the model is on cpu. You may experience unexpected behaviors or slower generation.
Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cpu')
before running `.generate()`.
  warnings.warn(

Environment: transformers v4.26.1 accelerate v0.17.0 PyTorch v1.13.1 MacOS 13.2.1 (22D68) Python 3.9.6

Mar 13 '23 07:03 srogatch

MPS devices are indeed not supported with device_map="auto" yet. As a workaround you should just move your model to that device manually.

Mar 13 '23 15:03 sgugger

MPS devices are indeed not supported with device_map="auto" yet. As a workaround you should just move your model to that device manually.

How to move the model to that device manually? Will I lose CPU and disk offload in that case?

Mar 13 '23 15:03 srogatch

Yes, CPU and disk offload are not supported with the MPS device either for now. To move your model to the MPS device, you just do model = model.to("mps")

Mar 13 '23 15:03 sgugger

Manually moving a model to MPS does not seem to work. Below is a minimal example:

Python 3.10.9 | packaged by conda-forge | (main, Feb  2 2023, 20:26:08) [Clang 14.0.6 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.11.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from transformers import T5ForConditionalGeneration, AutoTokenizer

In [2]: tokenizer = AutoTokenizer.from_pretrained('t5-small', model_max_length=512)
   ...: model = T5ForConditionalGeneration.from_pretrained('t5-small', device_map='auto')

In [3]: model.device
Out[3]: device(type='cpu')

In [4]: input_string = 'translate English to German: The house is wonderful."'
   ...: inputs = tokenizer(input_string, return_tensors='pt').input_ids
   ...: outputs = model.generate(inputs, max_length=200)
   ...: print(tokenizer.decode(outputs[0]))
<pad> Das Haus ist wunderbar."</s>

In [5]: model = model.to('mps')

In [6]: model.device
Out[6]: device(type='mps', index=0)

In [7]: inputs = inputs.to('mps')
   ...: outputs = model.generate(inputs, max_length=200)
   ...: print(tokenizer.decode(outputs[0]))

RuntimeError: Placeholder storage has not been allocated on MPS device!

Transformers version: 4.27.1 Accelerate version: 0.17.1 Torch version: 2.0.0 MacOS 13.2.1 (22D68)

Mar 19 '23 15:03 yadamonk

Yes you need to load it without the device_map="auto".

Mar 20 '23 13:03 sgugger

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 13 '23 15:04 github-actions[bot]

Hi, I am on M2 MAX CHIP MACOS that has 12 CPU, 38 GPU. I am having issue with ever modification of this code snippet. Would you please tell me how I can correct it?

from transformers import AutoTokenizer, AutoModelForCausalLM import transformers import torch

model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-40b-instruct", trust_remote_code=True) model = model.to('mps')

tokenizer = AutoTokenizer.from_pretrained(model) pipeline = transformers.pipeline( "text-generation", model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16, trust_remote_code=True, # device = torch.device('mps'), # device_map="auto", )

May 31 '23 16:05 phdykd

Hi, I am on M2 MAX CHIP MACOS that has 12 CPU, 38 GPU. I am having issue with ever modification of this code snippet. Would you please tell me how I can correct it?

from transformers import AutoTokenizer, AutoModelForCausalLM import transformers import torch

model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-40b-instruct", trust_remote_code=True) model = model.to('mps')

tokenizer = AutoTokenizer.from_pretrained(model) pipeline = transformers.pipeline( "text-generation", model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16, trust_remote_code=True, # device = torch.device('mps'), # device_map="auto", )

I also meet the problem.

Jun 01 '23 00:06 ryzn0518

Any solution yet?

Jun 01 '23 12:06 phdykd

Should the issue at least stay open as a feature request? This would be very nice to have.

Jun 09 '23 16:06 adampauls

THis is solved in the latest version of Accelerate (cc @SunMarc )

Jun 09 '23 16:06 sgugger

THis is solved in the latest version of Accelerate (cc @SunMarc )

@sgugger Is this fix included in the latest https://github.com/huggingface/transformers/releases/tag/v4.30.2 release?

Jul 13 '23 01:07 benjaminhuo

It's in Accelerate, not Transformers. It will be in the version of Accelerate released today.

Jul 13 '23 11:07 sgugger

Any solution for this issue? How can we ask the model to use MPS instead of CPU?

Sep 28 '23 00:09 moradisina

Hi @moradisina, since the version v0.20.0: of accelerate, mps device is supported with device_map="auto". It should automatically map your model to mps device if you are using a M2 chip .

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM("facebook/opt-350m",device_map="auto")
# should return {"":"mps"}
print(model.hf_device_map)

You can also do it manually by setting device_map={"":"mps"}:

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM("facebook/opt-350m",device_map={"":"mps"})
# should return {"":"mps"}
print(model.hf_device_map)

Sep 28 '23 08:09 SunMarc

transformers transformers copied to clipboard

device_map='auto' doesn't use MPS backend on Apple M2

transformers
transformers copied to clipboard