diffusers apple: support for MLX quantized linear in diffusers

apple: support for MLX quantized linear in diffusers

Open bghira opened this issue 10 months ago • 1 comments

Is your feature request related to a problem? Please describe.

As an Apple MPS user, it always feels somewhat like we're second-class citizens with respect to the latest and greatest optimisations that only happen for other platforms. The biggest deal is likely xformers/bitsandbytes which remain CUDA-only, but the outcome is more important than the codepath used to get there.

Describe the solution you'd like.

I've discovered Apple has some MLX examples for T2I inference on SDXL and other SD models that allow AoT quantization of the unet and text encoders.

Describe alternatives you've considered.

There is metal-flash-attention but it would require writing integrating custom Metal kernels, which feels out of scope for Diffusers.

We also have a couple forks of bitsandbytes which aim to improve portability, but there's nothing actionable yet.

Additional context.

I haven't tried to implement it yet, it would probably require a bit of monkeying around.

Apr 14 '24 16:04 bghira

cc @pcuenca here

Apr 23 '24 00:04 yiyixuxu

Diffusers MLX back-end? 👀

That would be very cool

Aug 13 '24 00:08 awni

yes especially with larger models and apple themselves kinda dropping the ball repeatedly over multiple years on pytorch support. MLX is their bread and butter. it's so good that it should probably even be used by default on MPS instead of pytorch.

Aug 13 '24 00:08 bghira

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sep 14 '24 15:09 github-actions[bot]

diffusers diffusers copied to clipboard

apple: support for MLX quantized linear in diffusers

diffusers
diffusers copied to clipboard