candle icon indicating copy to clipboard operation
candle copied to clipboard

Metal Backend not properly loading large models at 16GB of RAM

Open bayedieng opened this issue 1 year ago • 15 comments

Ran Phi 2 model with the metal features enabled and seems to hang with about 7% of GPU usage from Activity monitor. This seems to be recent as it ran at an adequate speed wit h some earlier commit, not sure which though. Also ran with stable diffusion turbo and i'm getting these results:

Screenshot 2024-01-11 at 5 55 56 PM

The accelerate feature seems to be twice as fast as the metal one when this was not the case before.

bayedieng avatar Jan 11 '24 17:01 bayedieng

+1 For phi 2 on my macbook pro M1

  • cpu has 2.29 token/s
  • metal has 1.46 token/s

snehmehta avatar Jan 12 '24 20:01 snehmehta

I have a M1 pro 32gb. Metal: 7.30 token/s vs accelerate: 2.28 token/s. Is it still slow for you?

ivarflakstad avatar Jan 15 '24 07:01 ivarflakstad

Thanks ! Which branch are you referring to? When would you be merging it with the main branch ?

On Mon, 15 Jan 2024 at 1:03 PM, ivarflakstad @.***> wrote:

I have a M1 pro 32gb. Metal: 7.30 token/s vs accelerate: 2.28 token/s. Is it still slow for you?

— Reply to this email directly, view it on GitHub https://github.com/huggingface/candle/issues/1568#issuecomment-1891465835, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXGU4CPJPZ2ZZ7KYC6B3SDYOTLSZAVCNFSM6AAAAABBW3QEXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJRGQ3DKOBTGU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

okpatil4u avatar Jan 15 '24 07:01 okpatil4u

I'm on the main brain. That's why I'm asking if it is still slow for you :)

ivarflakstad avatar Jan 15 '24 08:01 ivarflakstad

I haven't tried for a really long time. I was waiting for this issue to be closed as an indication for MPS usability. Is it ready to be used now ?

okpatil4u avatar Jan 15 '24 09:01 okpatil4u

There is experimental Metal support. Not using MPS right now - might add it in the future as a fallback for compatability reasons at some point. So yes there is mac gpu support, and you can definitely play around with it, but don't expect insane speeds yet as we haven't started optimizing. It'll get there though.

ivarflakstad avatar Jan 15 '24 09:01 ivarflakstad

I have a M1 pro 32gb. Metal: 7.30 token/s vs accelerate: 2.28 token/s. Is it still slow for you?

Just pulled the latest commit. I'm getting less then a token/s if the phi model manages to fully load at all (sometimes just hangs) on my base 16 inch M2 Macbook Pro. The accelerate feature still outperforms on it on my machine with 1.38 tokens/s vs 0.64 on the metal one when it it manged to run. Perhaps there is some large loading of memory going on which favors your 32GBs of RAM? Also tried Stable Diffusion Turbo and accelerate is still faster.

bayedieng avatar Jan 15 '24 10:01 bayedieng

Memory could be the issue, but then I would expect your computer to be showing signs of that as you are running the model. Is it?

For comparison, could you run a phi-1 and see if the issue persists?

ivarflakstad avatar Jan 15 '24 14:01 ivarflakstad

@bayedieng could you try out : https://github.com/huggingface/candle/pull/1523 maybe ?

It's possible for metal to be slower if there's not enough memory available to run the model I think, otherwise it doesn't make a lot of sense.

Potential culprits: 1/ Fences 2/ simd sizes

The PR I linked removes the fences. They were necessary (still are technically) to avoid bugs where kernels would spawn in non expected order, leading to differences in logits. However, I tried to remove them, and the model overall still behaves correctly on all platforms I could test one with a ~2x speedup on M3, therefore I went for it

2/ https://github.com/huggingface/candle/blob/main/candle-metal-kernels/src/lib.rs#L1314 Here is the code that makes choices on the simd sizes of the actual matmul. Choice of those values have a super high impact on the overall speed of models. And we do need to tune them for each machine ideally (M1, M2, M3, and depends on the RAM size too). However, I'm not sure how to do that generally enough for now. You could still probably play around with numbers and see how well it performs maybe. Also we might need a specialized gemv implementation for those A.B.t matmul (the current matmul is highly optimized for the general case, A B.t, can be optimized further on its own because all those fetches are aligned)

Narsil avatar Jan 16 '24 09:01 Narsil

I get a quantized not covered error when running quantized phi 2. However, it does seem to be an issue with memory as phi 1.5 expectantly outperforms accelerate. Perhaps some of the values @Narsil mentioned for SIMD changed at some point making it less optimal for lower memory, because both phi 2 and stable diffusion turbo ran faster on metal then accelerate in the past.

bayedieng avatar Jan 16 '24 22:01 bayedieng

because both phi 2 and stable diffusion turbo ran faster on metal then accelerate in the past.

The SIMD line is this single one: https://github.com/huggingface/candle/blob/main/candle-metal-kernels/src/lib.rs#L1277

Otherwise it's because of MPS usage (which we can't used because it' s bugged and doesn't support arbitrary striding which is necessary in candle to work properly for all models.

in the past.

Do you know when, or which commit/branch/version ? Might help narrow it down.

Narsil avatar Jan 17 '24 09:01 Narsil

I tried the quantized phi 2 model and the Metal backend outperforms the accelerate framework as intended so it does indeed seem to be a memory issue. I might have been wrong about an older version being faster as I have just tried Stable Diffusion Turbo on an commit hash 85e568027731e58b72fb2798c525a5d8aff65eb8 and accelerate was still faster than metal. Both inference and loading don't seem optimal for models that use larger memory, and Phi 2 only seems to fully perform inference on occasion when using Metal.

bayedieng avatar Jan 19 '24 14:01 bayedieng

@bayedieng I recently refurbished the buffer allocator for metal, which is now merged in main - would you mind checking if it has improved the issue? :)

ivarflakstad avatar Mar 07 '24 16:03 ivarflakstad

I've attempted to run it and I am still dealing with the same issue, entire system lags and model still runs inference slowly. the accelerate framework is still outperforming metal.

bayedieng avatar Mar 10 '24 10:03 bayedieng

Ok thanks. Could you try using cargo-instruments -t Allocations and share what it looks like? :)

ivarflakstad avatar Mar 10 '24 14:03 ivarflakstad