mlx-examples icon indicating copy to clipboard operation
mlx-examples copied to clipboard

Stable diffusion xl

Open pbkowalski opened this issue 8 months ago • 4 comments

As per title, created a stable diffusion XL class, including:

  1. multiple text encoders (including CLIPTextModelWithProjection)
  2. checking unet up_block_types and down_block_types when generating downsampling and upsampling layers
  3. importing unit transformer_layers_per_block correctly from config.json

The add_embedding layers were not implemented (yet), but the default SDXL model runs fine

pbkowalski avatar Dec 20 '23 22:12 pbkowalski

Looks nice! I requested a review from @angeloskath as I'm not that familiar with this example.

awni avatar Dec 21 '23 03:12 awni

@pbkowalski this looks fantastic! Thanks for the contribution. However, from a first experimentation something seems to not work properly yet so it will take time for me to track down the bug. If you have tracked it down and it works for you then let me know.

As far as more general comments go:

  • Replacing SD with SD-XL is probably not what we want to do so the changes should not break SD loading and evaluation
  • The way the example is set up SD should be the same class regardless of the underlying tokenizer, text model etc. In this particular case, SD-XL is the same as SD except for using one more text model. I think the way to go is to simply define a text model and tokenizer that would perform this job as a single entity and work with the original SD class.
  • Since we initially had only one model I didn't add the option but there should be the option of choosing a model by name in the txt2image script and not hardcoding a specific one.

Let me know what you think of the above and thanks again for adding SD-XL this is gonna be awesome :-D

angeloskath avatar Dec 21 '23 06:12 angeloskath

@angeloskath Could you tell me more about the issue you encountered? I have some issues with the generation quality, but in general it seems to work - and given the subjectivity it is difficult to tell if low quality is an implementation bug or some setting I missed (although I did get a segfault once for no apparent reason) The following is generation result for "a very happy donkey": out From my experiments (not many), sometimes there seem to be 'grid-like' artefacts (like the 4th image) - something I will look into

Otherwise I agree with your comments and I will probably implement some of this over the weekend.

I wonder if it would also be worthwhile to also fix the way the models are loaded - either offload to a 'model_library' file or have a method for fetching the entire model dict (values in current _MODELS dict) just from the hugging face model name

pbkowalski avatar Dec 21 '23 13:12 pbkowalski

I am from the diffusers team :)

Happy to review and help if needed. Just a ping away.

sayakpaul avatar Dec 29 '23 04:12 sayakpaul

@pbkowalski quite some time passed until I had the time to take a close look at porting SD-XL. There were several things that were needed so I made a new PR. I will close this but thanks a lot for starting it!

angeloskath avatar Mar 02 '24 08:03 angeloskath