[Feature Request] (Long Term): Detailed documentation of stable-diffusion.cpp and its features
At some point when the addition of new features/model support and addressing bugs are at a lull, it would be of great benefit to this project and its users if the documentation got an overhaul.
Best case scenario would be to "ELI5" it; basically providing in-depth instructions and definitions of how stable-diffusion.cpp can be used with each supported model and all their features that would teach the "lowest common denominator" (of which I am a card-carrying member... I can be as dense as a bag of hammers sometimes) how to use the software. Believe me, I have ~15 years of aggregate experience writing technical process documentation, so I know how tedious it can be, but getting it done means productive users and fewer issues being filed. I'd love to do this for you, but I don't know anywhere near enough about any of this stuff--neither the programming aspect nor the technical functions of generative models--to be of any real use, otherwise I'd be bombarding you with .md files instead of posting this long winded whatnot.
Please do not consider this as any sort of complaint about stable-diffusion.cpp. I really like this software and will continue using it as long as it's available. The reason behind this write up is because I don't have any real understanding of its full capabilities and would like to know what can be done with it. Some problems I'm having may even be due to unreported bugs, but I don't have enough information to make that determination.
A major peeve I have with a lot of indie open-source projects is their lack of documentation, or worse, documentation that assumes the end-users of their software already know enough about the intended function to not require instructions. The second scenario is pretty common, especially when it comes to generative AI-related projects. When it comes to image gen usage in general, the most common instance of this is people saying, "Just import my workflow," expecting the end-user to be running ComfyUI, which is of no use to anyone who either can't or won't run ComfyUI for whatever reason. Of course, even those who do run Comfy can run into problems when faced with a "switchboard" full of presets, they want to make a change, but have no idea what the rat's nest of interconnections before them does without a lot of reckless experimentation and frustrating time in a search engine. The worst part is that, due to the rise of AI-generated tutorial websites and "hallucinatory" answers coming from ChatGPT and the like, trying to find good answers to a complete newbie's questions is made that much more difficult.
Since stable-diffusion.cpp could be considered the antithesis of ComfyUI and doesn't have anywhere near the adoption rate, much more robust documentation geared toward those who are completely new to GenAI is really necessary, even if one considers this software to be "simple" by comparison to any other.
Arch Linux provides a prime example of the right way to go about software documentation. The entire philosophy of that particular OS is to "keep it simple," and yet they have the most in-depth and useful wiki of any Linux distro out there today. And that wiki is critical for them, because "simple" doesn't mean "easy." There's an expectation that Arch users need to learn how to use it properly, and they provide all the resources necessary for such self-education. (And no, I don't use Arch, BTW.)
As far as this project goes, the addition of the WAN video set is an example of where documentation is sorely lacking. Considering that WAN is the newest addition and pretty complicated in its own right, the lack of details is understandable, but plans to correct that should be part of the overall project.
While the example scripts by model are a good starting point for some folks, they don't explain what any of the settings do, whether or not they're necessary, or what other options might be of benefit. There's also no indication of resource usage per model, which would have been good to know before diving in. E.g., the standard "a lovely cat" T2V example for Wan2.2 TI2V 5B (specifically, Wan2.2-TI2V-5B-Q8_0.gguf was tested in my case) at 832x480 has a VAE compute buffer size of >20 GB(!). --vae-tiling doesn't appear to be an option for WAN, at least it ignored the option when set, which sort of makes sense considering it's producing sequential frame data for video and not a single static image, so --vae-on-cpu is necessary for my 16 GB GPU. It takes ~30 minutes to output a 2 second video because of this, so longer videos are out of the question if I want to use my PC for anything else.
Just to further illustrate the point, here's a few example questions I personally have about WAN:
- In general, which stable-diffusion.cpp options are relevant to WAN, and which ones are irrelevant? Which ones are WAN-specific and which ones are shared with other models?
- E.g., as mentioned,
--vae-tilingdoesn't appear to work. What other settings are ignored?
- E.g., as mentioned,
- What back-ends support WAN?
- E.g., Vulkan appears to be pretty weak for WAN 2.1 (requires both
--clip-on-cpuand--vae-on-cpu) and doesn't seem to work at all for WAN 2.2. (Could be something similar to the problems with Vulkan in #851). ROCm works for both 2.1/2.2, but requires--clip-on-cputo avoid black output videos.
- E.g., Vulkan appears to be pretty weak for WAN 2.1 (requires both
- What does each of the WAN-relevant options do, exactly?
- E.g., what is
--flow-shiftcontrol? What effect does setting a value have, what's the range, and what happens when you adjust the value higher or lower? - What about
--vace-strengthfor the 2.1 VACE models? Links to external explainers/tutorials that don't leave the end-user scratching their head in confusion would be beneficial here.
- E.g., what is
- What output sizes do each WAN model support?
- E.g., on WAN 2.1, I can only seem to get viable results with 416x240 or 832x480. Any other resolution inbetween produces gibberish video or throws errors. 416x240 requires high
--stepsthough (50 on average), to avoid garbage output (body horror). - On WAN 2.2 TI2V 5B, 832x480 works (if I'm willing to wait 30 minutes for the VAE step to trudge along in my Ryzen 7 3700x). 624x360 almost works, at least it produces recognizable images in the output video, but requires high
--stepslike WAN 2.1. Any other resolution inbetween produces junk.
- E.g., on WAN 2.1, I can only seem to get viable results with 416x240 or 832x480. Any other resolution inbetween produces gibberish video or throws errors. 416x240 requires high
Of course, the lack of documentation isn't limited to WAN.
Chroma appears to have its own set of command options that aren't explained anywhere...
--chroma-disable-dit-mask disable dit mask for chroma
--chroma-enable-t5-mask enable t5 mask for chroma
--chroma-t5-mask-pad PAD_SIZE t5 mask pad size of chroma
...but I've discovered that using both --guidance 0 and --chroma-disable-dit-mask when generating with Vulkan greatly reduces instances of body horror and increases prompt adherence. --chroma-disable-dit-mask is present in the example script for the model (doesn't mean I know what a DiT mask is or why turning it off is important), but I found --guidance 0 in the discussion chain of PR #696 and found adding it to be beneficial on my own. Could be an issue of correlation not equaling causation, but if it appears to work, I'm using it.
Control Net: Outside of "Control Net support with SD 1.5" in the Features description, there's really no mention further mention of it. As far as I'm aware, one should be able to use Control Net masks with SDXL, Chroma and other models, but no guidance is provided on what files need to be downloaded and how to set its options.
Okay... I'm sure if you've read through all this by now, you get the point. Thanks for sparing the time.
On cuda I cant see a difference with --guidance 0. It should be set to 0 in code anyway.
https://github.com/leejet/stable-diffusion.cpp/blob/35843c77ea57d16d26ef0b61780c79bcb19c6040/flux.hpp#L976
In practice, it appears to default to 3.5 if unset. This is the sample_params line from the log output of a recent Chroma run (QuantStack version of Chroma1-HD-Flash-Q8_0.gguf from huggingface). Note the distilled_guidance value:
sample_params: (txt_cfg: 1.00, img_cfg: 1.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: smoothstep, sample_method: euler, sample_steps: 8, eta: 0.00, shifted_timestep: 0)
...so this may be a bug, unless that default value is getting silently changed to 0 somewhere in process.
Which kinda hints at the overall gist of this FR. It's not in the docs, so I don't know.
I can confirm that disabling the line @Green-Sky pointed to results in bad (pure noise) images, unless --guidance is kept very low (under 0.001 or so).
And looks like it's really not supposed to be supported in Chroma: huggingface/diffusers#12359 .
Image metadata is still being set, though, since that's on an upper code layer. Maybe we should have a way to query the library about the actual parameters used for each generation.
So it looks like my use of --guidance 0 had a placebo effect. This is very good to know, and makes for a good point to demo the type of info additions that would be useful in Chroma.md:
Usage Tips:
- Although Chroma is based on the FLUX.1 architecture, it is its own unique foundational model that functions differently than FLUX.1. Image generation with Chroma is known to be slower than FLUX.1 models, so longer generation times with Chroma vs. FLUX.1 are normal.
- Unlike FLUX.1, Chroma makes use of the T5XXL text encoder only. There is no need to set the
--clip_lflag/file path when using Chroma. - Unlike FLUX.1, Chroma does not use a "distilled guidance" scale. stable-diffusion.cpp silently sets the sample parameter for
distilled_guidanceto zero during use of Chroma models, and any related setting of the--guidanceflag is ignored. - Certain backends may require the use of
--clip-on-cpuand/or--chroma-disable-dit-maskflags to avoid black output images.- NOTE: As of the release of master-320-1c32fa0, the use of
--clip-on-cpushould be no longer necessary to avoid black output.
- NOTE: As of the release of master-320-1c32fa0, the use of
- Users with low-VRAM GPUs may require the use of
--diffusion-fa,--vae-conv-direct, and/or--vae-tilingto lower VRAM usage. These may also be necessary if increasing the output resolution of a generated image.- Note: Some backends may require
--diffusion-fato be run with--chroma-disable-dit-maskalso set to avoid black output images. This is known to be true when using the ROCm backend.
- Note: Some backends may require
- Chroma models work best with
--schedulerset to eithersimpleorsmoothstep. Note thatsmoothstepcan be considered a substitute for the "beta" scheduler, which has not been added to stable-diffusion.cpp (...yet #811 ). - Chroma-Flash models -- fine-tuned versions of the Chroma Base model designed for speed -- are recommended to be run with
--cfg-scale 1 --steps 8 --scheduler smoothstep. Increasing the--stepsvalue can be used to improve output quality/prompt adherence.--sampling-method euleris commonly used with Chroma models, but experimenting with other samplers may yield better outputs.- Chroma-Flash models specified to have been trained with specific samplers should use those same samplers for the best quality output. E.g., Silveroxides v47-flash-heun model should be run with
--sampling-method heun.
- If using a Chroma-Flash model, the
--cfg-scale 1setting will result in the negative prompt text being ignored. Negative prompt usage can be restored by setting the value to any number >1. e.g.,--cfg-scale 1.01will trigger negative prompt use with no negative CFG impact on the image generation, but doing so will double the process time for both the text encoding and latent image generation steps.
Please check my work. I get anything completely wrong here?
Edit: Added a couple of extra points that may not be common knowledge. Edit 2: Listed more options for low-VRAM users Edit... uh n?: Simplified the low-VRAM whatnot, clarified the Chroma-Flash sampler whatnot + other sundries.
- Users with low-VRAM GPUs may require the use of the --vae-tiling flag to lower VRAM usage during the latent decoding (VAE) stage. This flag may also be necessary if increasing the output resolution of a generated image.
Prefer using --vae-conv-direct to reduce memory usage. Depending on the backend it can also speed things up (or make it much slower (cuda)). Also can be combined with --vae-tiling, but you are probably reaching other memory limits before you have to resort to that.
Nice! Forgot about that since it seems to slow Vulkan down too, so I don't use it.
I've update the "low-VRAM" info above to be more robust and included --diffusion-fa as well. How's that grab you?
This option might crash if not supported by the backend.
Right, this is no longer true, as it now checks if it is supported. Is the help text outdated?
Also flash attention speeds things up, depending on backend.
Nice! Forgot about that since it seems to slow Vulkan down too, so I don't use it.
Ah right, on my nvidia, vulkan with cm2 is actually faster than the current non conv-direct cuda version...
Right, this is no longer true, as it now checks if it is supported. Is the help text outdated?
Yes, it is. Do the support checks run for both --diffusion-fa and --vae-conv-direct?
Also flash attention speeds things up, depending on backend.
Noted. Thank you. Will update.
Oh, yeah... Can you explain --chroma-enable-t5-mask and --chroma-t5-mask-pad PAD_SIZE? If they're relevant options, I'd like to describe them here too. Do they work if --chroma-disable-dit-mask is set or ?
Right, this is no longer true, as it now checks if it is supported. Is the help text outdated?
Yes, it is. Do the support checks run for both
--diffusion-faand--vae-conv-direct?
No, conv-direct has no support check. But I think every major backend has it now? (vulkan, opencl and slow cuda)
Oh, yeah... Can you explain
--chroma-enable-t5-maskand--chroma-t5-mask-pad PAD_SIZE? If they're relevant options, I'd like to describe them here too.
pad tells the ... diffusion model(?) how many pad tokens to attend to in the attention part. Functionally this is similar to an attention sink, or the <bos> hack facebook discovered with llama(?). The value chroma uses is 1, which is also what we default to. This is different than what flux does (0).
@stduhpf I'm not sure the pad count is actually for t5, is it not for the dit mask? or was it for both? (it clearly effects the dit mask)
Do they work if
--chroma-disable-dit-maskis set or ?
They should be separate, and as you can see by the naming, one is enabled, the other not by default.
No, conv-direct has no support check. But I think every major backend has it now? (vulkan, opencl and slow cuda)
Okay. I'm leaning toward leaving the crash warning in for --vae-conv-direct just to cover the edge-case, but will remove it from --diffusion-fa
Oh, yeah... Can you explain
--chroma-enable-t5-maskand--chroma-t5-mask-pad PAD_SIZE? If they're relevant options, I'd like to describe them here too.
padtells the ... diffusion model(?) how many pad tokens to attend to in the attention part. Functionally this is similar to an attention sink, or the<bos>hack facebook discovered with llama(?). The value chroma uses is 1, which is also what we default to. This is different than what flux does (0).
I have zero clue as to what an attention sink is. You wouldn't happen to have a link to a reliable source that would explain this or specifically the t5-mask to a layman, would you? (Remember... I can be dense as a bag of hammer sometimes.)
Most of the information about cfg-scaleand negatives is not specific to Chroma; perhaps a better place would be a 'general guidelines' file or section?
The cfg-scale info applies to any model that uses CFG, like SD1.5 and SDXL. And many models distilled with other techniques (LCM, PCM, Hyper, Lightning, DMD2, Turbo...) behave the same way as Chroma-Flash: requiring low or minimal cfg-scale, and thus ignoring the negative prompt.
By the way, the info about cfg-scale and negatives is actually true for any model that uses CFG; it's just that low cfg-scale typically only gives good results for distilled models. And negatives are still pretty much ignored with low (but above one) cfg-scale values, although the higher value can affect the image in other ways.
You may want to mention that 8-step Heun takes as long as 15-step Euler. Again, not specific to Chroma: in general, n-step second-order samplers (Heun, DPM2) take as long as (2n-1)-step first-order ones (Euler, DDIM, DPM++2M,...), so any speed/quality comparison should take that into account.
Forgot to mention: for Chroma, diffusion-fa doesn't work without chroma-disable-dit-mask. I don't know if there's a backend that always requires chroma-disable-dit-mask (plain Vulkan works fine for me; ROCm needs clip-on-cpu, but that's a T5 issue, not specific to Chroma).
Tackling this one first:
Forgot to mention: for Chroma, diffusion-fa doesn't work without chroma-disable-dit-mask. I don't know if there's a backend that always requires chroma-disable-dit-mask (plain Vulkan works fine for me; ROCm needs clip-on-cpu, but that's a T5 issue, not specific to Chroma).
Yes, ROCm requires --chroma-disable-dit-mask to be set when using --diffusion-fa. Completely forgot about this, since I primarily use Vulkan for Chroma, but had --diffusion-fa set as an option when I copied the base cmd from a WAN test run over to a Chroma run and discovered my output was black.
Most of the information about
cfg-scaleand negatives is not specific to Chroma; perhaps a better place would be a 'general guidelines' file or section?
A "general guidelines" .md is definitely a goal, but if there's one important thing I've learned about process documentation, it's that end-users are going to focus in on the instructions for the feature(s) they want to use, and have a tendency to avoid cross-links to other docs as either confusing or annoying. Redundancy is part of the tedium of the work, but it pays off in the long run by reducing the number of questions already answered, just on another page they didn't see. Basic human nature can be a PITA. ;-)
In my mind, "general guidelines" would be a much more detailed version of the current help text + some basics about downloading and storing models, just so the command line text and parameter flags get a solid, fundamental explanation. From there, each model page would get its own "specific" guidelines/tips like I'm doing with Chroma here.
You may want to mention that 8-step Heun takes as long as 15-step Euler. Again, not specific to Chroma: in general, n-step second-order samplers (Heun, DPM2) take as long as (2n-1)-step first-order ones (Euler, DDIM, DPM++2M,...), so any speed/quality comparison should take that into account.
Let me see if I understand this... heun, as an "n-step second-order" sampler type at --steps 8, takes the same amount of time to process as a "first step" sampler like euler at --steps 15 (because heun effectively does (2n-1) steps). Does this second-order sampling roughly translate to the output quality of --sampling-method heun --steps 8 being close or equivalent to --sampling-method euler --steps 15, or am I on drugs? I know that dpm++2m at --steps 8 with Chroma1-HD-Flash-Q8_0.gguf looks like garbage compared to both heun and euler, so I don't use it.
I have zero clue as to what an attention sink is. You wouldn't happen to have a link to a reliable source that would explain this or specifically the t5-mask to a layman, would you? (Remember... I can be dense as a bag of hammer sometimes.)
Chroma readme: https://huggingface.co/lodestones/Chroma#mmdit-masking
https://www.evanmiller.org/attention-is-off-by-one.html and other variants (like what openai released with their recent "oss"-gpt) are all doing similar things.
Let me see if I understand this...
heun, as an "n-step second-order" sampler type at--steps 8, takes the same amount of time to process as a "first step" sampler likeeulerat--steps 15(becauseheuneffectively does (2n-1) steps).
Yes; they call the model (which is the expensive part) twice for each step (except for the last one).
Does this second-order sampling roughly translate to the output quality of
--sampling-method heun --steps 8being close or equivalent to--sampling-method euler --steps 15, or am I on drugs? I know thatdpm++2mat--steps 8with Chroma1-HD-Flash-Q8_0.gguf looks like garbage compared to bothheunandeuler, so I don't use it.
It is a bit of a general trend, but far from a rule. That heavily depends on each model+sampler combination. Even "more steps -> more quality" isn't always true.
But my main point was: when comparing quality for different samplers, you need to consider resource usage, and although that is proportional to the number of steps, it isn't the same thing. If e.g. someone says "with this model, Heun produces a good image in just 8 steps, while Euler requires 13 for the same quality", it sounds like Heun is better, but Euler reaches that same quality in ~15% less time 🙂
And a model trained for 8-step Heun (like is the case for some Chroma releases) could produce with it a quality level that Euler wouldn't reach in any number of steps. It... depends.
end-users are going to focus in on the instructions for the feature(s) they want to use, and have a tendency to avoid cross-links to other docs as either confusing or annoying
That's not an issue with the availability of detailed documentation, is it? 😉 And I'm sure some people could consider the current level of detail as annoying already...
Remember those docs need to be maintained too: four copies of the same info across four different model families will be incomplete and/or out-of-sync. Perhaps it'd be best not to dwell too much on each model page: mention straight to the point what is recommended, a few workarounds, and leave a link for another page or section explaining why is it like this, comparing and contrasting models, more detailed descriptions, etc.
It is a bit of a general trend, but far from a rule. That heavily depends on each model+sampler combination. Even "more steps -> more quality" isn't always true.
But my main point was: when comparing quality for different samplers, you need to consider resource usage, and although that is proportional to the number of steps, it isn't the same thing. If e.g. someone says "with this model, Heun produces a good image in just 8 steps, while Euler requires 13 for the same quality", it sounds like Heun is better, but Euler reaches that same quality in ~15% less time 🙂
And a model trained for 8-step Heun (like is the case for some Chroma releases) could produce with it a quality level that Euler wouldn't reach in any number of steps. It... depends.
Excellent! This is exactly the kind of information I was completely oblivious to, and you've explained it beautifully. Thank you!
Now if I could just get my head around T5 masking... Right now I'm running 8 image batches with the --chroma-t5-mask-pad value doubling every run. Just started a batch with a value of 64, and for the life of me, I'm not seeing any reason for these options to exist at all.
end-users are going to focus in on the instructions for the feature(s) they want to use, and have a tendency to avoid cross-links to other docs as either confusing or annoying
That's not an issue with the availability of detailed documentation, is it? 😉 And I'm sure some people could consider the current level of detail as annoying already...
Heh... annoying is my forte. 😉 Seriously, though, think of it this way: You have to gather your sand before you can make a sandcastle, and the unwanted twigs and rocks get pulled out as you build. My usual tactic is to detail everything and then simplify as I go along.
On the flip side of that coin, what I have down for Chroma may come off as annoyingly overblown to someone who already understands the software and what the options do, but to a newbie who isn't familiar with any of this stuff, it's all gold if it helps them get up and running.
When it comes to docs, you're not writing them for yourself or anyone who understands the product.
You're writing them for everyone who doesn't understand it, in any capacity.
You're writing for those PC users who only see the Windows OS as "the computer" and their web browser as "the internet" (or as "Google," which is worse.)
Remember those docs need to be maintained too: four copies of the same info across four different model families will be incomplete and/or out-of-sync. Perhaps it'd be best not to dwell too much on each model page: mention straight to the point what is recommended, a few workarounds, and leave a link for another page or section explaining why is it like this, comparing and contrasting models, more detailed descriptions, etc.
...And you've just described a wiki. Is that what y'all want to shoot for? Branch-linked docs like wikis are great, but can easily spiral out of editorial control to degrees much higher than your worries about maintaining "four docs across four model families." Explaining model-relevant options per model page can be more useful to end users, but if a well-written, from-the-ground-up wiki is an option, and all the bases can be covered from soup to nuts, I'd be more than happy to contribute what and when I can.
Okay... So I finished running a bunch of experiments with --chroma-enable-t5-mask with pad size values ranging from 2 to 512, and there doesn't appear to be any benefit to using the T5 mask options at all. No observed differences in prompt adherence, RAM/VRAM usage, generation speed, image quality... big fat goose egg. I tested with both Flash and non-Flash models, random seeds and a couple of runs with a fixed seed. Nothing.
Do these parameters need to exist?
Now if I could just get my head around T5 masking... Right now I'm running 8 image batches with the
--chroma-t5-mask-padvalue doubling every run. Just started a batch with a value of 64, and for the life of me, I'm not seeing any reason for these options to exist at all.
Chroma started off as flux, which means in the beginning the new behavior was not as required, since it still knew the "old behavior". Also disabling the dit mask was very fine, still into the 30s and maybe early 40s range of releases.
Chroma started off as flux, which means in the beginning the new behavior was not as required, since it still knew the "old behavior". Also disabling the dit mask was very fine, still into the 30s and maybe early 40s range of releases.
So... these parameters are artifacts and can be ignored? If that's the case, you could have said so from the beginning. Regardless, no function = no documentation.
I've made a few changes to the Usage Tips. Please review as time allows.
Regarding the behavior of the guidance parameter with Chroma, I forced it to 0 because it's unsupported. In the reference implementation I used to make it work here, the guidance parameter was implemented on the library, but also forced to 0. I guess the creator of Chroma planned on maybe using distilled guidance at some point but never actually did it.
Regarding the behavior of the
guidanceparameter with Chroma, I forced it to 0 because it's unsupported. In the reference implementation I used to make it work here, the guidance parameter was implemented on the library, but also forced to 0. I guess the creator of Chroma planned on maybe using distilled guidance at some point but never actually did it.
Hmm... Well, unless that's changed for Chroma1-HD/Chroma1-Base and sd.cpp needs to be updated to reflect that, the Usage Tip can stand as-is. Through user comments on Civit.ai, I know folks have been using distilled_guidance values with Chroma, so it's definitely worth noting the zero function here.
Adding this in here as part of the overall "sand pile" for now. Will most likely edit/add to it later for clarification.
On Output Image Resolutions vs. Aspect Ratios
For Convolutional U-Net Models (UNet), such as Stable Diffusion 1.x, 2.x and SDXL:
UNet model types require that generated image resolutions be evenly divisible by 64 in each dimension.
While any combination of -H and -W values where each is set to a multiple of 64 can be used, trying to generate an image with a specific aspect ratio -- e.g., a landscape image at 4:3 or 16:9 -- can run into limitations of this x64 requirement. The following table shows viable aspect ratios where each dimension are even multiples of 64:
| Base Res | 3:2 Ratio | 4:3 Ratio | 16:9 Ratio | 16:10 Ratio |
|---|---|---|---|---|
| 192 | 128 | --- | --- | --- |
| 256 | --- | 192 | --- | --- |
| 384 | 256 | --- | --- | --- |
| 512 | --- | 384 | --- | 320 |
| 576 | 384 | --- | --- | --- |
| 768 | 512 | 576 | --- | --- |
| 960 | 640 | --- | --- | --- |
| 1024 | --- | 768 | 576 | 640 |
| 1152 | 768 | --- | --- | --- |
| 1280 | --- | 960 | --- | --- |
| 1344 | 896 | --- | --- | --- |
| 1536 | 1024 | 1152 | --- | 960 |
| 1728 | 1152 | --- | --- | --- |
| 1792 | --- | 1344 | --- | --- |
| 1920 | 1280 | --- | --- | --- |
| 2048 | --- | 1536 | 1152 | 1280 |
For Diffusion Transformer (DiT) Models, such as SD3.x. FLUX.1 and Chroma:
Per #742, DiT model types have a less-restrictive requirement that generated image resolutions be evenly divisible by 16 in each dimension.
This allows for a greater number of output resolutions that can be used which fit specific aspect ratios. The following table shows viable aspect ratios where each dimension are even multiples of 16:
| Base Res | 3:2 Ratio | 4:3 Ratio | 16:9 Ratio | 16:10 Ratio |
|---|---|---|---|---|
| 64 | --- | 48 | --- | --- |
| 128 | --- | 96 | --- | 80 |
| 192 | 128 | 144 | --- | --- |
| 256 | --- | 192 | 144 | 160 |
| 320 | --- | 240 | --- | --- |
| 384 | 256 | 288 | --- | 240 |
| 448 | --- | 336 | --- | --- |
| 512 | --- | 384 | 288 | 320 |
| 576 | 384 | 432 | --- | --- |
| 640 | --- | 480 | --- | 400 |
| 704 | --- | 528 | --- | --- |
| 768 | 512 | 576 | 432 | 480 |
| 832 | --- | 624 | --- | --- |
| 896 | --- | 672 | --- | 560 |
| 960 | 640 | 720 | --- | --- |
| 1024 | --- | 768 | 576 | 640 |
| 1088 | --- | 816 | --- | --- |
| 1152 | 768 | 864 | --- | 720 |
| 1216 | --- | 912 | --- | --- |
| 1280 | --- | 960 | 720 | 800 |
| 1344 | 896 | 1008 | --- | --- |
| 1408 | --- | 1056 | --- | 880 |
| 1472 | --- | 1104 | --- | --- |
| 1536 | 1024 | 1152 | 864 | 960 |
| 1600 | --- | 1200 | --- | --- |
| 1664 | --- | 1248 | --- | 1040 |
| 1728 | 1152 | 1296 | --- | --- |
| 1792 | --- | 1344 | 1008 | 1120 |
| 1856 | --- | 1392 | --- | --- |
| 1920 | 1280 | 1440 | --- | 1200 |
| 1984 | --- | 1488 | --- | --- |
| 2048 | --- | 1536 | 1152 | 1280 |
It should be noted that the common "Full HD" 16:9 resolution of 1920x1080 can't be set because 1080 isn't evenly divisible by 16. Setting the short dimension to 1088 will work as a nearest-neighbor alternative.
Edit: Expanded to include U-Net model info and additional aspect ratios