galai
galai copied to clipboard
how to run the model on mps device?
how to run the model on mps device?
Hi @vinbrule, can you try something like this:
from galai import gal
model = gal.load_model("base", num_gpus=0)
model.model.to("mps")
model.generate("The Transformer architecture [START_REF]")
?
Good suggestion, but unfortunately does not work. Due pytorch bug https://github.com/pytorch/pytorch/issues/77764
File "/opt/homebrew/Caskroom/miniforge/base/envs/galai/lib/python3.9/site-packages/transformers/models/opt/modeling_opt.py", line 113, in forward positions = (torch.cumsum(attention_mask, dim=1).type_as(attention_mask) * attention_mask).long() - 1 NotImplementedError: The operator 'aten::cumsum.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable
PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
Thanks @cerkut for testing it and for the stack trace. It seems that cumsum.out
operator is implemented in the pytorch nightly (https://github.com/pytorch/pytorch/pull/88319).
Also, did you try the PYTORCH_ENABLE_MPS_FALLBACK=1
suggestion? The fallback for the cumsum
operator shouldn't cause any significant slowdown in this case.
Thanks, almost working now. I created an environment with pytorch-nightly # 2.0.0.dev20221215, installed galai with pip+https and tried
import galai as gal # note it is not from galai import gal
model = gal.load_model("base", num_gpus=0)
At this point, when I run the model on CPU, I get the perfect answer:
model.generate("The Transformer architecture [START_REF]")
'The Transformer architecture [START_REF] Attention is All you Need, Vaswani[END_REF] is a popular choice for sequence-to-sequence models. It consists of a stack of encoder and decoder layers, each of which is composed of a multi-head self-attention mechanism and a feed-forward network. The encoder is used to encode the'
But on MPS
model.model.to("mps") # prompts the full model
and I get
'The Transformer architecture [START_REF] following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following following'
This might be a torch-nightly bug, but the CPU seems indeed faster than MPS in inference so I'll stick to CPU.