candle
candle copied to clipboard
Metal Backend not properly loading large models at 16GB of RAM
Ran Phi 2 model with the metal features enabled and seems to hang with about 7% of GPU usage from Activity monitor. This seems to be recent as it ran at an adequate speed wit h some earlier commit, not sure which though. Also ran with stable diffusion turbo and i'm getting these results:
The accelerate feature seems to be twice as fast as the metal one when this was not the case before.
+1 For phi 2 on my macbook pro M1
- cpu has 2.29 token/s
- metal has 1.46 token/s
I have a M1 pro 32gb. Metal: 7.30 token/s vs accelerate: 2.28 token/s. Is it still slow for you?
Thanks ! Which branch are you referring to? When would you be merging it with the main branch ?
On Mon, 15 Jan 2024 at 1:03 PM, ivarflakstad @.***> wrote:
I have a M1 pro 32gb. Metal: 7.30 token/s vs accelerate: 2.28 token/s. Is it still slow for you?
— Reply to this email directly, view it on GitHub https://github.com/huggingface/candle/issues/1568#issuecomment-1891465835, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXGU4CPJPZ2ZZ7KYC6B3SDYOTLSZAVCNFSM6AAAAABBW3QEXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJRGQ3DKOBTGU . You are receiving this because you are subscribed to this thread.Message ID: @.***>
I'm on the main brain. That's why I'm asking if it is still slow for you :)
I haven't tried for a really long time. I was waiting for this issue to be closed as an indication for MPS usability. Is it ready to be used now ?
There is experimental Metal support. Not using MPS right now - might add it in the future as a fallback for compatability reasons at some point. So yes there is mac gpu support, and you can definitely play around with it, but don't expect insane speeds yet as we haven't started optimizing. It'll get there though.
I have a M1 pro 32gb. Metal: 7.30 token/s vs accelerate: 2.28 token/s. Is it still slow for you?
Just pulled the latest commit. I'm getting less then a token/s if the phi model manages to fully load at all (sometimes just hangs) on my base 16 inch M2 Macbook Pro. The accelerate feature still outperforms on it on my machine with 1.38 tokens/s vs 0.64 on the metal one when it it manged to run. Perhaps there is some large loading of memory going on which favors your 32GBs of RAM? Also tried Stable Diffusion Turbo and accelerate is still faster.
Memory could be the issue, but then I would expect your computer to be showing signs of that as you are running the model. Is it?
For comparison, could you run a phi-1 and see if the issue persists?
@bayedieng could you try out : https://github.com/huggingface/candle/pull/1523 maybe ?
It's possible for metal to be slower if there's not enough memory available to run the model I think, otherwise it doesn't make a lot of sense.
Potential culprits: 1/ Fences 2/ simd sizes
The PR I linked removes the fences. They were necessary (still are technically) to avoid bugs where kernels would spawn in non expected order, leading to differences in logits. However, I tried to remove them, and the model overall still behaves correctly on all platforms I could test one with a ~2x speedup on M3, therefore I went for it
2/ https://github.com/huggingface/candle/blob/main/candle-metal-kernels/src/lib.rs#L1314 Here is the code that makes choices on the simd sizes of the actual matmul. Choice of those values have a super high impact on the overall speed of models. And we do need to tune them for each machine ideally (M1, M2, M3, and depends on the RAM size too). However, I'm not sure how to do that generally enough for now. You could still probably play around with numbers and see how well it performs maybe. Also we might need a specialized gemv implementation for those A.B.t matmul (the current matmul is highly optimized for the general case, A B.t, can be optimized further on its own because all those fetches are aligned)
I get a quantized not covered error when running quantized phi 2. However, it does seem to be an issue with memory as phi 1.5 expectantly outperforms accelerate. Perhaps some of the values @Narsil mentioned for SIMD changed at some point making it less optimal for lower memory, because both phi 2 and stable diffusion turbo ran faster on metal then accelerate in the past.
because both phi 2 and stable diffusion turbo ran faster on metal then accelerate in the past.
The SIMD line is this single one: https://github.com/huggingface/candle/blob/main/candle-metal-kernels/src/lib.rs#L1277
Otherwise it's because of MPS usage (which we can't used because it' s bugged and doesn't support arbitrary striding which is necessary in candle to work properly for all models.
in the past.
Do you know when, or which commit/branch/version ? Might help narrow it down.
I tried the quantized phi 2 model and the Metal backend outperforms the accelerate framework as intended so it does indeed seem to be a memory issue. I might have been wrong about an older version being faster as I have just tried Stable Diffusion Turbo on an commit hash 85e568027731e58b72fb2798c525a5d8aff65eb8 and accelerate was still faster than metal. Both inference and loading don't seem optimal for models that use larger memory, and Phi 2 only seems to fully perform inference on occasion when using Metal.
@bayedieng I recently refurbished the buffer allocator for metal, which is now merged in main - would you mind checking if it has improved the issue? :)
I've attempted to run it and I am still dealing with the same issue, entire system lags and model still runs inference slowly. the accelerate framework is still outperforming metal.
Ok thanks. Could you try using cargo-instruments -t Allocations
and share what it looks like? :)