galai icon indicating copy to clipboard operation
galai copied to clipboard

how to run the model on mps device?

Open vinbrule opened this issue 2 years ago • 5 comments

how to run the model on mps device?

vinbrule avatar Nov 17 '22 22:11 vinbrule

Hi @vinbrule, can you try something like this:

from galai import gal
model = gal.load_model("base", num_gpus=0)
model.model.to("mps")
model.generate("The Transformer architecture [START_REF]")

?

mkardas avatar Dec 09 '22 11:12 mkardas

Good suggestion, but unfortunately does not work. Due pytorch bug https://github.com/pytorch/pytorch/issues/77764 File "/opt/homebrew/Caskroom/miniforge/base/envs/galai/lib/python3.9/site-packages/transformers/models/opt/modeling_opt.py", line 113, in forward positions = (torch.cumsum(attention_mask, dim=1).type_as(attention_mask) * attention_mask).long() - 1 NotImplementedError: The operator 'aten::cumsum.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variablePYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

cerkut avatar Dec 14 '22 17:12 cerkut

Thanks @cerkut for testing it and for the stack trace. It seems that cumsum.out operator is implemented in the pytorch nightly (https://github.com/pytorch/pytorch/pull/88319).

mkardas avatar Dec 14 '22 20:12 mkardas

Also, did you try the PYTORCH_ENABLE_MPS_FALLBACK=1 suggestion? The fallback for the cumsum operator shouldn't cause any significant slowdown in this case.

mkardas avatar Dec 14 '22 20:12 mkardas

Thanks, almost working now. I created an environment with pytorch-nightly # 2.0.0.dev20221215, installed galai with pip+https and tried

import galai as gal # note it is not from galai import gal model = gal.load_model("base", num_gpus=0)

At this point, when I run the model on CPU, I get the perfect answer: model.generate("The Transformer architecture [START_REF]")

'The Transformer architecture [START_REF] Attention is All you Need, Vaswani[END_REF] is a popular choice for sequence-to-sequence models. It consists of a stack of encoder and decoder layers, each of which is composed of a multi-head self-attention mechanism and a feed-forward network. The encoder is used to encode the'

But on MPS model.model.to("mps") # prompts the full model and I get 'The Transformer architecture [START_REF] following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following' This might be a torch-nightly bug, but the CPU seems indeed faster than MPS in inference so I'll stick to CPU.

cerkut avatar Dec 15 '22 14:12 cerkut