candle Refactor the metal backend to always reuse command encoders/buffers unless a shared memory access is requested

This change aims to replace the pattern of each tensor provisioning a command buffer and encoder for each kernel operation that occurs, to a pattern where an encoder is provided to a kernel to setup it's operations.

Prior to this change, we relied on the fact that command buffers were executed sequentially to handle ordering of operations and ensuring that an output from one operation has completed before being used in a downstream operation. We now leverage Metal's resource based memory barriers to do this more effectively by ensuring we only block operations on the completion of operations on input dependencies.

What are the outcomes in terms of performance for this change? Well in summary not much... in theory I would have expected a minor gain due to the improved parallelization and lower overhead of only using a few command buffers and encoders, however in practice for the example models we do not see a change in performance.

Now although we don't see a change in performance on our models, we get a much more stable gputrace output as a result of this change. On main generating a gputrace would often cause OOM errors due to the amount of extra recorded command buffer/encoders, we now get a much more streamlined recording of the models, and the memory required to load these traces has been cut down drastically.

The major changes to note here are the following:

command buffers are lazily initialized on calls to MetalDevice::command_buffer
command encoders are lazily initialized on calls to MetalDevice::command_encoder
a new RwLock is introduced for tracking the command encoder
the device struct is modified to make access to the command encoder and buffers private to ensure users always use the helper methods for accessing those methods
the method for copying data to the cpu has been adjusted to handle closing the compute encoder and initializing the compute encoder
reduced usage of the "wait_until_completed" instruction for the copy instructions, we now only rely on this when we require synchronization of resources between CPU and GPU

Apr 10 '24 17:04 tomsanbear

Hoping this branch will address some portion of https://github.com/huggingface/candle/issues/1939

Apr 10 '24 17:04 tomsanbear

Okay, I'm happy with the PR in it's current state!

@ivarflakstad it'd be great to get your eyes on this change as well!

Feel free to reach out on discord if you want to go through parts of the change together and avoid back and forth 👍

Apr 12 '24 15:04 tomsanbear

@LaurentMazare I did some cleanup of the areas commented above, lmk if you have any other ideas on improving docs

Apr 13 '24 14:04 tomsanbear

Just an update, I've run through most examples and they work fine, the exception is stable diffusion (wuerstchen is fine though...) which outputs garbage on the photo compared to main, i'll be looking into this and will tag when I find the root cause, not immediately obvious so far.

Apr 13 '24 20:04 tomsanbear

Just an update, I've run through most examples and they work fine, the exception is stable diffusion (wuerstchen is fine though...) which outputs garbage on the photo compared to main, i'll be looking into this and will tag when I find the root cause, not immediately obvious so far.

Gave it a quick look and it indeed looks pretty weird. When enabling intermediary-images (to make it use the vae earlier) and comparing with the base version via some print statements I saw the first difference on the last layer of the vae, but I also got a run where it didn't seem to happen. So it may well be some form of race condition, not sure if it's between kernels, or when fetching the data back from the gpu, also worth pointing out that my mac gets very low on memory at that point (with 16GB) - but I would expect an actual error rather than data corruption if this is an issue. Two other things that may be worth trying if easy: disabling the buffer reuse and use freshly allocated buffers each time, and limiting the maximum number of encoded ops in the buffer.

Apr 14 '24 07:04 LaurentMazare

Any luck with this? It would be great to have to analyze performance on metal so pretty keen to have it if it still works with stable diffusion etc.

Apr 28 '24 18:04 LaurentMazare

Any luck with this? It would be great to have to analyze performance on metal so pretty keen to have it if it still works with stable diffusion etc.

Unfortunately haven't had the bandwidth the last two weeks, I should have some more time coming up where I can spend some time investigating. I was going back through the implementation I had and was trying to hunt down where the change was introduced, so far nothing seemed apparent, so now I'm running through the operations that are performed in the stable diffusion model so that I can understand what's going on.

Apr 28 '24 19:04 tomsanbear

updated the branch here with main so we don't stay too far out of sync

Apr 28 '24 19:04 tomsanbear

Just an update @LaurentMazare

I did finally trace this down to something odd going on in the memory of certain tensors. It appears that at some point the buffer for the "encoder_hidden_states" is getting overwritten with zeros, this seems to also be happening with other buffers... Hence why images generated are coming out black right now. I'm not sure where this is coming from.. I initially thought it may be the buffer reuse, but after disabling that the issue is still reproducible...

As for next steps, I'm not quite sure... If anyone has a clue/inkling why this may be occurring for the stable diffusion example but not for any other example I would love to know, I'm just not familiar enough to have a clear "aha" moment on this yet.

May 06 '24 15:05 tomsanbear

candle candle copied to clipboard

Refactor the metal backend to always reuse command encoders/buffers unless a shared memory access is requested

candle
candle copied to clipboard