stable-diffusion.cpp icon indicating copy to clipboard operation
stable-diffusion.cpp copied to clipboard

[Feature Request] (Long Term): Detailed documentation of stable-diffusion.cpp and its features

Open MrSnichovitch opened this issue 2 months ago • 22 comments

At some point when the addition of new features/model support and addressing bugs are at a lull, it would be of great benefit to this project and its users if the documentation got an overhaul.

Best case scenario would be to "ELI5" it; basically providing in-depth instructions and definitions of how stable-diffusion.cpp can be used with each supported model and all their features that would teach the "lowest common denominator" (of which I am a card-carrying member... I can be as dense as a bag of hammers sometimes) how to use the software. Believe me, I have ~15 years of aggregate experience writing technical process documentation, so I know how tedious it can be, but getting it done means productive users and fewer issues being filed. I'd love to do this for you, but I don't know anywhere near enough about any of this stuff--neither the programming aspect nor the technical functions of generative models--to be of any real use, otherwise I'd be bombarding you with .md files instead of posting this long winded whatnot.

Please do not consider this as any sort of complaint about stable-diffusion.cpp. I really like this software and will continue using it as long as it's available. The reason behind this write up is because I don't have any real understanding of its full capabilities and would like to know what can be done with it. Some problems I'm having may even be due to unreported bugs, but I don't have enough information to make that determination.

A major peeve I have with a lot of indie open-source projects is their lack of documentation, or worse, documentation that assumes the end-users of their software already know enough about the intended function to not require instructions. The second scenario is pretty common, especially when it comes to generative AI-related projects. When it comes to image gen usage in general, the most common instance of this is people saying, "Just import my workflow," expecting the end-user to be running ComfyUI, which is of no use to anyone who either can't or won't run ComfyUI for whatever reason. Of course, even those who do run Comfy can run into problems when faced with a "switchboard" full of presets, they want to make a change, but have no idea what the rat's nest of interconnections before them does without a lot of reckless experimentation and frustrating time in a search engine. The worst part is that, due to the rise of AI-generated tutorial websites and "hallucinatory" answers coming from ChatGPT and the like, trying to find good answers to a complete newbie's questions is made that much more difficult.

Since stable-diffusion.cpp could be considered the antithesis of ComfyUI and doesn't have anywhere near the adoption rate, much more robust documentation geared toward those who are completely new to GenAI is really necessary, even if one considers this software to be "simple" by comparison to any other.

Arch Linux provides a prime example of the right way to go about software documentation. The entire philosophy of that particular OS is to "keep it simple," and yet they have the most in-depth and useful wiki of any Linux distro out there today. And that wiki is critical for them, because "simple" doesn't mean "easy." There's an expectation that Arch users need to learn how to use it properly, and they provide all the resources necessary for such self-education. (And no, I don't use Arch, BTW.)

As far as this project goes, the addition of the WAN video set is an example of where documentation is sorely lacking. Considering that WAN is the newest addition and pretty complicated in its own right, the lack of details is understandable, but plans to correct that should be part of the overall project.

While the example scripts by model are a good starting point for some folks, they don't explain what any of the settings do, whether or not they're necessary, or what other options might be of benefit. There's also no indication of resource usage per model, which would have been good to know before diving in. E.g., the standard "a lovely cat" T2V example for Wan2.2 TI2V 5B (specifically, Wan2.2-TI2V-5B-Q8_0.gguf was tested in my case) at 832x480 has a VAE compute buffer size of >20 GB(!). --vae-tiling doesn't appear to be an option for WAN, at least it ignored the option when set, which sort of makes sense considering it's producing sequential frame data for video and not a single static image, so --vae-on-cpu is necessary for my 16 GB GPU. It takes ~30 minutes to output a 2 second video because of this, so longer videos are out of the question if I want to use my PC for anything else.

Just to further illustrate the point, here's a few example questions I personally have about WAN:

  1. In general, which stable-diffusion.cpp options are relevant to WAN, and which ones are irrelevant? Which ones are WAN-specific and which ones are shared with other models?
    • E.g., as mentioned, --vae-tiling doesn't appear to work. What other settings are ignored?
  2. What back-ends support WAN?
    • E.g., Vulkan appears to be pretty weak for WAN 2.1 (requires both --clip-on-cpu and --vae-on-cpu) and doesn't seem to work at all for WAN 2.2. (Could be something similar to the problems with Vulkan in #851). ROCm works for both 2.1/2.2, but requires --clip-on-cpu to avoid black output videos.
  3. What does each of the WAN-relevant options do, exactly?
    • E.g., what is --flow-shift control? What effect does setting a value have, what's the range, and what happens when you adjust the value higher or lower?
    • What about --vace-strength for the 2.1 VACE models? Links to external explainers/tutorials that don't leave the end-user scratching their head in confusion would be beneficial here.
  4. What output sizes do each WAN model support?
    • E.g., on WAN 2.1, I can only seem to get viable results with 416x240 or 832x480. Any other resolution inbetween produces gibberish video or throws errors. 416x240 requires high --steps though (50 on average), to avoid garbage output (body horror).
    • On WAN 2.2 TI2V 5B, 832x480 works (if I'm willing to wait 30 minutes for the VAE step to trudge along in my Ryzen 7 3700x). 624x360 almost works, at least it produces recognizable images in the output video, but requires high --steps like WAN 2.1. Any other resolution inbetween produces junk.

Of course, the lack of documentation isn't limited to WAN.

Chroma appears to have its own set of command options that aren't explained anywhere...

  --chroma-disable-dit-mask          disable dit mask for chroma
  --chroma-enable-t5-mask            enable t5 mask for chroma
  --chroma-t5-mask-pad  PAD_SIZE     t5 mask pad size of chroma

...but I've discovered that using both --guidance 0 and --chroma-disable-dit-mask when generating with Vulkan greatly reduces instances of body horror and increases prompt adherence. --chroma-disable-dit-mask is present in the example script for the model (doesn't mean I know what a DiT mask is or why turning it off is important), but I found --guidance 0 in the discussion chain of PR #696 and found adding it to be beneficial on my own. Could be an issue of correlation not equaling causation, but if it appears to work, I'm using it.

Control Net: Outside of "Control Net support with SD 1.5" in the Features description, there's really no mention further mention of it. As far as I'm aware, one should be able to use Control Net masks with SDXL, Chroma and other models, but no guidance is provided on what files need to be downloaded and how to set its options.

Okay... I'm sure if you've read through all this by now, you get the point. Thanks for sparing the time.

MrSnichovitch avatar Sep 29 '25 02:09 MrSnichovitch

On cuda I cant see a difference with --guidance 0. It should be set to 0 in code anyway. https://github.com/leejet/stable-diffusion.cpp/blob/35843c77ea57d16d26ef0b61780c79bcb19c6040/flux.hpp#L976

Green-Sky avatar Oct 02 '25 17:10 Green-Sky

In practice, it appears to default to 3.5 if unset. This is the sample_params line from the log output of a recent Chroma run (QuantStack version of Chroma1-HD-Flash-Q8_0.gguf from huggingface). Note the distilled_guidance value:

    sample_params:                     (txt_cfg: 1.00, img_cfg: 1.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: smoothstep, sample_method: euler, sample_steps: 8, eta: 0.00, shifted_timestep: 0)

...so this may be a bug, unless that default value is getting silently changed to 0 somewhere in process.

Which kinda hints at the overall gist of this FR. It's not in the docs, so I don't know.

MrSnichovitch avatar Oct 02 '25 17:10 MrSnichovitch

I can confirm that disabling the line @Green-Sky pointed to results in bad (pure noise) images, unless --guidance is kept very low (under 0.001 or so).

And looks like it's really not supposed to be supported in Chroma: huggingface/diffusers#12359 .

Image metadata is still being set, though, since that's on an upper code layer. Maybe we should have a way to query the library about the actual parameters used for each generation.

wbruna avatar Oct 02 '25 23:10 wbruna

So it looks like my use of --guidance 0 had a placebo effect. This is very good to know, and makes for a good point to demo the type of info additions that would be useful in Chroma.md:


Usage Tips:

  • Although Chroma is based on the FLUX.1 architecture, it is its own unique foundational model that functions differently than FLUX.1. Image generation with Chroma is known to be slower than FLUX.1 models, so longer generation times with Chroma vs. FLUX.1 are normal.
  • Unlike FLUX.1, Chroma makes use of the T5XXL text encoder only. There is no need to set the --clip_l flag/file path when using Chroma.
  • Unlike FLUX.1, Chroma does not use a "distilled guidance" scale. stable-diffusion.cpp silently sets the sample parameter for distilled_guidance to zero during use of Chroma models, and any related setting of the --guidance flag is ignored.
  • Certain backends may require the use of --clip-on-cpu and/or --chroma-disable-dit-mask flags to avoid black output images.
    • NOTE: As of the release of master-320-1c32fa0, the use of --clip-on-cpu should be no longer necessary to avoid black output.
  • Users with low-VRAM GPUs may require the use of --diffusion-fa, --vae-conv-direct, and/or --vae-tiling to lower VRAM usage. These may also be necessary if increasing the output resolution of a generated image.
    • Note: Some backends may require --diffusion-fa to be run with --chroma-disable-dit-mask also set to avoid black output images. This is known to be true when using the ROCm backend.
  • Chroma models work best with --scheduler set to either simple or smoothstep. Note that smoothstep can be considered a substitute for the "beta" scheduler, which has not been added to stable-diffusion.cpp (...yet #811 ).
  • Chroma-Flash models -- fine-tuned versions of the Chroma Base model designed for speed -- are recommended to be run with --cfg-scale 1 --steps 8 --scheduler smoothstep. Increasing the --steps value can be used to improve output quality/prompt adherence.
    • --sampling-method euler is commonly used with Chroma models, but experimenting with other samplers may yield better outputs.
    • Chroma-Flash models specified to have been trained with specific samplers should use those same samplers for the best quality output. E.g., Silveroxides v47-flash-heun model should be run with --sampling-method heun.
  • If using a Chroma-Flash model, the --cfg-scale 1 setting will result in the negative prompt text being ignored. Negative prompt usage can be restored by setting the value to any number >1. e.g., --cfg-scale 1.01 will trigger negative prompt use with no negative CFG impact on the image generation, but doing so will double the process time for both the text encoding and latent image generation steps.

Please check my work. I get anything completely wrong here?

Edit: Added a couple of extra points that may not be common knowledge. Edit 2: Listed more options for low-VRAM users Edit... uh n?: Simplified the low-VRAM whatnot, clarified the Chroma-Flash sampler whatnot + other sundries.

MrSnichovitch avatar Oct 03 '25 01:10 MrSnichovitch

  • Users with low-VRAM GPUs may require the use of the --vae-tiling flag to lower VRAM usage during the latent decoding (VAE) stage. This flag may also be necessary if increasing the output resolution of a generated image.

Prefer using --vae-conv-direct to reduce memory usage. Depending on the backend it can also speed things up (or make it much slower (cuda)). Also can be combined with --vae-tiling, but you are probably reaching other memory limits before you have to resort to that.

Green-Sky avatar Oct 03 '25 11:10 Green-Sky

Nice! Forgot about that since it seems to slow Vulkan down too, so I don't use it.

I've update the "low-VRAM" info above to be more robust and included --diffusion-fa as well. How's that grab you?

MrSnichovitch avatar Oct 03 '25 19:10 MrSnichovitch

This option might crash if not supported by the backend.

Right, this is no longer true, as it now checks if it is supported. Is the help text outdated?

Also flash attention speeds things up, depending on backend.

Nice! Forgot about that since it seems to slow Vulkan down too, so I don't use it.

Ah right, on my nvidia, vulkan with cm2 is actually faster than the current non conv-direct cuda version...

Green-Sky avatar Oct 03 '25 19:10 Green-Sky

Right, this is no longer true, as it now checks if it is supported. Is the help text outdated?

Yes, it is. Do the support checks run for both --diffusion-fa and --vae-conv-direct?

Also flash attention speeds things up, depending on backend.

Noted. Thank you. Will update.

Oh, yeah... Can you explain --chroma-enable-t5-mask and --chroma-t5-mask-pad PAD_SIZE? If they're relevant options, I'd like to describe them here too. Do they work if --chroma-disable-dit-mask is set or ?

MrSnichovitch avatar Oct 03 '25 19:10 MrSnichovitch

Right, this is no longer true, as it now checks if it is supported. Is the help text outdated?

Yes, it is. Do the support checks run for both --diffusion-fa and --vae-conv-direct?

No, conv-direct has no support check. But I think every major backend has it now? (vulkan, opencl and slow cuda)

Oh, yeah... Can you explain --chroma-enable-t5-mask and --chroma-t5-mask-pad PAD_SIZE? If they're relevant options, I'd like to describe them here too.

pad tells the ... diffusion model(?) how many pad tokens to attend to in the attention part. Functionally this is similar to an attention sink, or the <bos> hack facebook discovered with llama(?). The value chroma uses is 1, which is also what we default to. This is different than what flux does (0).

@stduhpf I'm not sure the pad count is actually for t5, is it not for the dit mask? or was it for both? (it clearly effects the dit mask)

Do they work if --chroma-disable-dit-mask is set or ?

They should be separate, and as you can see by the naming, one is enabled, the other not by default.

Green-Sky avatar Oct 03 '25 20:10 Green-Sky

No, conv-direct has no support check. But I think every major backend has it now? (vulkan, opencl and slow cuda)

Okay. I'm leaning toward leaving the crash warning in for --vae-conv-direct just to cover the edge-case, but will remove it from --diffusion-fa

Oh, yeah... Can you explain --chroma-enable-t5-mask and --chroma-t5-mask-pad PAD_SIZE? If they're relevant options, I'd like to describe them here too.

pad tells the ... diffusion model(?) how many pad tokens to attend to in the attention part. Functionally this is similar to an attention sink, or the <bos> hack facebook discovered with llama(?). The value chroma uses is 1, which is also what we default to. This is different than what flux does (0).

I have zero clue as to what an attention sink is. You wouldn't happen to have a link to a reliable source that would explain this or specifically the t5-mask to a layman, would you? (Remember... I can be dense as a bag of hammer sometimes.)

MrSnichovitch avatar Oct 03 '25 20:10 MrSnichovitch

Most of the information about cfg-scaleand negatives is not specific to Chroma; perhaps a better place would be a 'general guidelines' file or section?

The cfg-scale info applies to any model that uses CFG, like SD1.5 and SDXL. And many models distilled with other techniques (LCM, PCM, Hyper, Lightning, DMD2, Turbo...) behave the same way as Chroma-Flash: requiring low or minimal cfg-scale, and thus ignoring the negative prompt.

By the way, the info about cfg-scale and negatives is actually true for any model that uses CFG; it's just that low cfg-scale typically only gives good results for distilled models. And negatives are still pretty much ignored with low (but above one) cfg-scale values, although the higher value can affect the image in other ways.

You may want to mention that 8-step Heun takes as long as 15-step Euler. Again, not specific to Chroma: in general, n-step second-order samplers (Heun, DPM2) take as long as (2n-1)-step first-order ones (Euler, DDIM, DPM++2M,...), so any speed/quality comparison should take that into account.

wbruna avatar Oct 03 '25 20:10 wbruna

Forgot to mention: for Chroma, diffusion-fa doesn't work without chroma-disable-dit-mask. I don't know if there's a backend that always requires chroma-disable-dit-mask (plain Vulkan works fine for me; ROCm needs clip-on-cpu, but that's a T5 issue, not specific to Chroma).

wbruna avatar Oct 03 '25 21:10 wbruna

Tackling this one first:

Forgot to mention: for Chroma, diffusion-fa doesn't work without chroma-disable-dit-mask. I don't know if there's a backend that always requires chroma-disable-dit-mask (plain Vulkan works fine for me; ROCm needs clip-on-cpu, but that's a T5 issue, not specific to Chroma).

Yes, ROCm requires --chroma-disable-dit-mask to be set when using --diffusion-fa. Completely forgot about this, since I primarily use Vulkan for Chroma, but had --diffusion-fa set as an option when I copied the base cmd from a WAN test run over to a Chroma run and discovered my output was black.

Most of the information about cfg-scale and negatives is not specific to Chroma; perhaps a better place would be a 'general guidelines' file or section?

A "general guidelines" .md is definitely a goal, but if there's one important thing I've learned about process documentation, it's that end-users are going to focus in on the instructions for the feature(s) they want to use, and have a tendency to avoid cross-links to other docs as either confusing or annoying. Redundancy is part of the tedium of the work, but it pays off in the long run by reducing the number of questions already answered, just on another page they didn't see. Basic human nature can be a PITA. ;-)

In my mind, "general guidelines" would be a much more detailed version of the current help text + some basics about downloading and storing models, just so the command line text and parameter flags get a solid, fundamental explanation. From there, each model page would get its own "specific" guidelines/tips like I'm doing with Chroma here.

You may want to mention that 8-step Heun takes as long as 15-step Euler. Again, not specific to Chroma: in general, n-step second-order samplers (Heun, DPM2) take as long as (2n-1)-step first-order ones (Euler, DDIM, DPM++2M,...), so any speed/quality comparison should take that into account.

Let me see if I understand this... heun, as an "n-step second-order" sampler type at --steps 8, takes the same amount of time to process as a "first step" sampler like euler at --steps 15 (because heun effectively does (2n-1) steps). Does this second-order sampling roughly translate to the output quality of --sampling-method heun --steps 8 being close or equivalent to --sampling-method euler --steps 15, or am I on drugs? I know that dpm++2m at --steps 8 with Chroma1-HD-Flash-Q8_0.gguf looks like garbage compared to both heun and euler, so I don't use it.

MrSnichovitch avatar Oct 03 '25 22:10 MrSnichovitch

I have zero clue as to what an attention sink is. You wouldn't happen to have a link to a reliable source that would explain this or specifically the t5-mask to a layman, would you? (Remember... I can be dense as a bag of hammer sometimes.)

Chroma readme: https://huggingface.co/lodestones/Chroma#mmdit-masking

https://www.evanmiller.org/attention-is-off-by-one.html and other variants (like what openai released with their recent "oss"-gpt) are all doing similar things.

Green-Sky avatar Oct 03 '25 22:10 Green-Sky

Let me see if I understand this... heun, as an "n-step second-order" sampler type at --steps 8, takes the same amount of time to process as a "first step" sampler like euler at --steps 15 (because heun effectively does (2n-1) steps).

Yes; they call the model (which is the expensive part) twice for each step (except for the last one).

Does this second-order sampling roughly translate to the output quality of --sampling-method heun --steps 8 being close or equivalent to --sampling-method euler --steps 15, or am I on drugs? I know that dpm++2m at --steps 8 with Chroma1-HD-Flash-Q8_0.gguf looks like garbage compared to both heun and euler, so I don't use it.

It is a bit of a general trend, but far from a rule. That heavily depends on each model+sampler combination. Even "more steps -> more quality" isn't always true.

But my main point was: when comparing quality for different samplers, you need to consider resource usage, and although that is proportional to the number of steps, it isn't the same thing. If e.g. someone says "with this model, Heun produces a good image in just 8 steps, while Euler requires 13 for the same quality", it sounds like Heun is better, but Euler reaches that same quality in ~15% less time 🙂

And a model trained for 8-step Heun (like is the case for some Chroma releases) could produce with it a quality level that Euler wouldn't reach in any number of steps. It... depends.

end-users are going to focus in on the instructions for the feature(s) they want to use, and have a tendency to avoid cross-links to other docs as either confusing or annoying

That's not an issue with the availability of detailed documentation, is it? 😉 And I'm sure some people could consider the current level of detail as annoying already...

Remember those docs need to be maintained too: four copies of the same info across four different model families will be incomplete and/or out-of-sync. Perhaps it'd be best not to dwell too much on each model page: mention straight to the point what is recommended, a few workarounds, and leave a link for another page or section explaining why is it like this, comparing and contrasting models, more detailed descriptions, etc.

wbruna avatar Oct 04 '25 01:10 wbruna

It is a bit of a general trend, but far from a rule. That heavily depends on each model+sampler combination. Even "more steps -> more quality" isn't always true.

But my main point was: when comparing quality for different samplers, you need to consider resource usage, and although that is proportional to the number of steps, it isn't the same thing. If e.g. someone says "with this model, Heun produces a good image in just 8 steps, while Euler requires 13 for the same quality", it sounds like Heun is better, but Euler reaches that same quality in ~15% less time 🙂

And a model trained for 8-step Heun (like is the case for some Chroma releases) could produce with it a quality level that Euler wouldn't reach in any number of steps. It... depends.

Excellent! This is exactly the kind of information I was completely oblivious to, and you've explained it beautifully. Thank you!

Now if I could just get my head around T5 masking... Right now I'm running 8 image batches with the --chroma-t5-mask-pad value doubling every run. Just started a batch with a value of 64, and for the life of me, I'm not seeing any reason for these options to exist at all.

end-users are going to focus in on the instructions for the feature(s) they want to use, and have a tendency to avoid cross-links to other docs as either confusing or annoying

That's not an issue with the availability of detailed documentation, is it? 😉 And I'm sure some people could consider the current level of detail as annoying already...

Heh... annoying is my forte. 😉 Seriously, though, think of it this way: You have to gather your sand before you can make a sandcastle, and the unwanted twigs and rocks get pulled out as you build. My usual tactic is to detail everything and then simplify as I go along.

On the flip side of that coin, what I have down for Chroma may come off as annoyingly overblown to someone who already understands the software and what the options do, but to a newbie who isn't familiar with any of this stuff, it's all gold if it helps them get up and running.

When it comes to docs, you're not writing them for yourself or anyone who understands the product.

You're writing them for everyone who doesn't understand it, in any capacity.

You're writing for those PC users who only see the Windows OS as "the computer" and their web browser as "the internet" (or as "Google," which is worse.)

Remember those docs need to be maintained too: four copies of the same info across four different model families will be incomplete and/or out-of-sync. Perhaps it'd be best not to dwell too much on each model page: mention straight to the point what is recommended, a few workarounds, and leave a link for another page or section explaining why is it like this, comparing and contrasting models, more detailed descriptions, etc.

...And you've just described a wiki. Is that what y'all want to shoot for? Branch-linked docs like wikis are great, but can easily spiral out of editorial control to degrees much higher than your worries about maintaining "four docs across four model families." Explaining model-relevant options per model page can be more useful to end users, but if a well-written, from-the-ground-up wiki is an option, and all the bases can be covered from soup to nuts, I'd be more than happy to contribute what and when I can.

MrSnichovitch avatar Oct 04 '25 03:10 MrSnichovitch

Okay... So I finished running a bunch of experiments with --chroma-enable-t5-mask with pad size values ranging from 2 to 512, and there doesn't appear to be any benefit to using the T5 mask options at all. No observed differences in prompt adherence, RAM/VRAM usage, generation speed, image quality... big fat goose egg. I tested with both Flash and non-Flash models, random seeds and a couple of runs with a fixed seed. Nothing.

Do these parameters need to exist?

MrSnichovitch avatar Oct 04 '25 05:10 MrSnichovitch

Now if I could just get my head around T5 masking... Right now I'm running 8 image batches with the --chroma-t5-mask-pad value doubling every run. Just started a batch with a value of 64, and for the life of me, I'm not seeing any reason for these options to exist at all.

Chroma started off as flux, which means in the beginning the new behavior was not as required, since it still knew the "old behavior". Also disabling the dit mask was very fine, still into the 30s and maybe early 40s range of releases.

Green-Sky avatar Oct 04 '25 10:10 Green-Sky

Chroma started off as flux, which means in the beginning the new behavior was not as required, since it still knew the "old behavior". Also disabling the dit mask was very fine, still into the 30s and maybe early 40s range of releases.

So... these parameters are artifacts and can be ignored? If that's the case, you could have said so from the beginning. Regardless, no function = no documentation.

I've made a few changes to the Usage Tips. Please review as time allows.

MrSnichovitch avatar Oct 04 '25 21:10 MrSnichovitch

Regarding the behavior of the guidance parameter with Chroma, I forced it to 0 because it's unsupported. In the reference implementation I used to make it work here, the guidance parameter was implemented on the library, but also forced to 0. I guess the creator of Chroma planned on maybe using distilled guidance at some point but never actually did it.

stduhpf avatar Oct 05 '25 12:10 stduhpf

Regarding the behavior of the guidance parameter with Chroma, I forced it to 0 because it's unsupported. In the reference implementation I used to make it work here, the guidance parameter was implemented on the library, but also forced to 0. I guess the creator of Chroma planned on maybe using distilled guidance at some point but never actually did it.

Hmm... Well, unless that's changed for Chroma1-HD/Chroma1-Base and sd.cpp needs to be updated to reflect that, the Usage Tip can stand as-is. Through user comments on Civit.ai, I know folks have been using distilled_guidance values with Chroma, so it's definitely worth noting the zero function here.

MrSnichovitch avatar Oct 05 '25 18:10 MrSnichovitch

Adding this in here as part of the overall "sand pile" for now. Will most likely edit/add to it later for clarification.


On Output Image Resolutions vs. Aspect Ratios

For Convolutional U-Net Models (UNet), such as Stable Diffusion 1.x, 2.x and SDXL:

UNet model types require that generated image resolutions be evenly divisible by 64 in each dimension.

While any combination of -H and -W values where each is set to a multiple of 64 can be used, trying to generate an image with a specific aspect ratio -- e.g., a landscape image at 4:3 or 16:9 -- can run into limitations of this x64 requirement. The following table shows viable aspect ratios where each dimension are even multiples of 64:

Base Res 3:2 Ratio 4:3 Ratio 16:9 Ratio 16:10 Ratio
192 128 --- --- ---
256 --- 192 --- ---
384 256 --- --- ---
512 --- 384 --- 320
576 384 --- --- ---
768 512 576 --- ---
960 640 --- --- ---
1024 --- 768 576 640
1152 768 --- --- ---
1280 --- 960 --- ---
1344 896 --- --- ---
1536 1024 1152 --- 960
1728 1152 --- --- ---
1792 --- 1344 --- ---
1920 1280 --- --- ---
2048 --- 1536 1152 1280

For Diffusion Transformer (DiT) Models, such as SD3.x. FLUX.1 and Chroma:

Per #742, DiT model types have a less-restrictive requirement that generated image resolutions be evenly divisible by 16 in each dimension.

This allows for a greater number of output resolutions that can be used which fit specific aspect ratios. The following table shows viable aspect ratios where each dimension are even multiples of 16:

Base Res 3:2 Ratio 4:3 Ratio 16:9 Ratio 16:10 Ratio
64 --- 48 --- ---
128 --- 96 --- 80
192 128 144 --- ---
256 --- 192 144 160
320 --- 240 --- ---
384 256 288 --- 240
448 --- 336 --- ---
512 --- 384 288 320
576 384 432 --- ---
640 --- 480 --- 400
704 --- 528 --- ---
768 512 576 432 480
832 --- 624 --- ---
896 --- 672 --- 560
960 640 720 --- ---
1024 --- 768 576 640
1088 --- 816 --- ---
1152 768 864 --- 720
1216 --- 912 --- ---
1280 --- 960 720 800
1344 896 1008 --- ---
1408 --- 1056 --- 880
1472 --- 1104 --- ---
1536 1024 1152 864 960
1600 --- 1200 --- ---
1664 --- 1248 --- 1040
1728 1152 1296 --- ---
1792 --- 1344 1008 1120
1856 --- 1392 --- ---
1920 1280 1440 --- 1200
1984 --- 1488 --- ---
2048 --- 1536 1152 1280

It should be noted that the common "Full HD" 16:9 resolution of 1920x1080 can't be set because 1080 isn't evenly divisible by 16. Setting the short dimension to 1088 will work as a nearest-neighbor alternative.


Edit: Expanded to include U-Net model info and additional aspect ratios

MrSnichovitch avatar Oct 26 '25 03:10 MrSnichovitch