Add support for Zero Terminal SNR models
I would greatly appreciate it if this plugin had support for zero terminal SNR models. ComfyUI already has good support for these models, and supporting them in this would be relatively straightforward. From my findings, zero terminal SNR models are also inherently better at inpainting without task-specific training, and having them in this would make it MUCH easier to take advantage of that.
The requirements for this are as follows:
- Ability to load a model in v_prediction mode with zsnr=True. This is done through the same
ModelSamplingDiscretenode that is already implemented and used alongside LCM. On that note, I don't believe this is compatible with the LCM Lora, which does quite handily solve the issue of how you deal with that conflict. - One or more other nodes implemented which clamp latents in order to deal with the overexposure problem that happens with zero terminal SNR models:
- The first option is the traditional
RescaleCFGnode from the original Lin, et. al paper that proposed zero terminal SNR as a fix for Stable Diffusion. It is very simple and at the recommended setting of 0.7 is practically guaranteed to make an image at least coherent. However I often find that it leaves images less exposed than I would like, and it seems to reintroduce some similar artifacting to what the learned mean brightness bias adds. - The second option, which I find can work better in a number of circumstances, is mcmonkey's SD Dynamic Thresholding node. Ignore the stated description of the extension for this purpose (although it would still be somewhat useful for its intended purpose). The
DynamicThresholdFullnode has a number of settings that can be useful for controlling exposure. It is rather complicated compared to CFG Rescale and it can be hard to find ideal settings for a given generation (if you wish to experiment, having both Modes set to Half Cosine Up with min=4.0 is a good starting point, and Linear Up with min=1.0 can work for samplers where that doesn't work). However, you can much more directly control overexposure using it, which I think makes it a desirable option for this plugin especially, since that can mean the difference between having a seam while inpainting or not. As a bonus, you can also use the extension for its intended purpose, it does give great results with higher CFG scales that are quite coherent. I understand that this would mean another dependency, but I would highly recommend implementing both.
- The first option is the traditional
If you need a zero terminal SNR model for testing these changes, I do have a very basic one that should be adequate: https://huggingface.co/drhead/ZeroDiffusion
I remember this paper/discovery, but somehow not much happened afterwards. By now there is such a huge amount of checkpoints trained on the original base, it seems difficult to get ZSNR/Vpred into the ecosystem. As long as there are no models that compete with popular checkpoints, there is no strong motivation to support them.
I tried it out a bit, the images look fine with RescaleCFG, but had a hard time actually generating something darker. Tweaking DynamicThreshold works better but is not particularly intuitive. It feels like you have to already know what kind of image is going to be generated to find a good value?
I have not tried inpaint yet. Noticed that Control Lora (lightweight version of full ControlNet models) seem to have no effect, but the full model works. Currently the plugin ships with the Lora version for most control modes.
It would be nice to automatically detect these kinds of models and add required nodes appropriately (and also disable things that are not supported). Is there a reliable way to distuingish these models from other SD1.5 checkpoints?
There are definitely more zero terminal SNR models than just mine. It doesn't take too unreasonably long to adapt SD1.5 to zero terminal SNR, I have seen people do it on 3090s, and I have also seen people get somewhat passable results by training with zero terminal SNR on epsilon prediction which takes much less time (though I haven't personally tested many of these out and can't speak with any confidence on how to properly use them). I actually don't know much about the state of the model ecosystem overall, I know that at least most of the current furry models have been trained with zero terminal SNR, I know there's a fair number of realistic models that have been trained on zero terminal SNR on either epsilon or velocity, and I strongly suspect that anime models are still mostly cargo culting on NAI model merges but I'm sure at least someone has made one. I have noticed that a lot of people use regular LoRAs trained on epsilon prediction just fine with v-prediction models, TI embeddings are hit or miss because the text encoder seems to diverge a bit much if it is unfrozen when retraining the model to v-prediction, and I have seen people using merge techniques like train difference to merge regular models with zero terminal SNR ones, so zero terminal SNR models are far from a walled garden separate from regular models. There is also more attention being given to zero terminal SNR noise schedules among researchers lately from what I have seen, and you will likely see more of it as text2vid and img2vid models get more developed (see Meta's Emu model paper, they used zero terminal SNR because the brightness bias applies over entire videos on T2V models and causes horrendous levels of brownout if you don't use ZSNR). Progress on rolling out more models trained on zero terminal SNR has been slow partly because it does require explicit support and some retraining, and the fact that the authors never released their model weights really didn't help either, but I can assure you that they are very usable right now and do have concrete benefits even compared to offset noise models.
Yes, CFG Rescale is a bit messy and in fact has been the bane of my existence since I started working with zero terminal SNR models, I have long suspected that it was never an optimal implementation, and I'm currently trying to improve its implementation since I do believe it should be possible to make a parameterless version. It does often cause apparent brownout or greyout, especially on higher settings, but which is still noticeable to some degree even on more modest settings. I have discovered that the likely cause is that it relies on an image-wide global normalization (mainly the scaling is derived from the standard deviation of all channels of every pixel together -- check out page 28 of this paper for a primer in how this type of thing creates problems throughout the whole model, and after you recover from learning how utterly fucked Stable Diffusion's architecture is, note that the same thing applies here on a smaller level). I have been testing alternatives to that and trying to figure out a way to derive an appropriate scaling factor from the sigma value for the current step and have the scaling fall off over time. I have had successes so far in vastly reducing brownout while improving coherency of both macro features and details and reducing artifacts/hallucinations, while also not requiring any parameter inputs -- if I add FreeU into the loop as well (would be a nice thing to add), the results I get are usually nearly flawless. I'm trying to see exactly how much I can get out of it, and I will probably pull request it to ComfyUI directly when I am satisfied with the results.
I have done a fair amount of research on ways to migrate existing tools and models over to ZSNR/vpred and while a number of tools do have broken compatibility (like merging with the stable-diffusion-inpainting model for the best possible inpainting control), a lot of the important ones do still work fine (like most ControlNet models that I have tested). I haven't used Control Lora models personally and actually haven't heard of them until now so I can't advise on that -- if performance is a major concern with the full controlnet models, maybe consider the t2iadapter versions? I would also be quite concerned that if ControlLoRA fails on v-prediction models where vanilla ControlNet does not, that ControlLoRA probably also has degraded performance (compared to regular controlnet) on SD models that have diverged too much from the base model. I recall being less than satisfied with ControlNet Inpaint on ZSNR models when I tested it (the only ControlNet model which I have had any real issues with), but then again I have never held a high opinion of it because it has always been worse than merging a model with the stable-diffusion-inpainting model (the performance gap between controlnet inpainting and sd-inpainting is about the same as that between the base model and controlnet inpainting, last time I tested, and the model merge has no performance overhead either unlike the ControlNet), and that's why I trained a zero terminal SNR inpainting model, but I suspect that the overexposure issues are holding it back somewhat and I haven't had time to test it in a while.
There is no way to detect these models automatically -- usually we just include a .yaml config alongside the model checkpoint for A1111 users that tells it to use v-prediction, and ComfyUI users just use the ModelSamplingDiscrete node. With the way the plugin is currently structured though, I see no reason why it couldn't be configured on a per-style basis. For any given model there is only one correct answer for whether the model uses v-prediction or not, and while the original noise schedule is close enough that it usually generates coherent results on zero terminal SNR models the output is almost always better with the ZSNR schedule (not using it often results in failure modes like a thick black border around generated images, for example). Since styles are defined around a single model checkpoint there's not much overhead from adding two checkboxes for this. ComfyUI's CFG rescale node seems to also assume use of v-prediction based on its comments and it is therefore useless without it, if that is helpful for reducing UI clutter.
The reason I ask for auto-detection, and also why I like ControlNet inpaint, is that users are going to download checkpoints, drop them into the folder, and expect it to work. They don't necessarily know what V-Prediction and zero terminal SNR is. And really, why should they? It's an implementation detail. Nor are all aware of inpaint models, lots of checkpoints don't provide one, and inpainting is such a basic feature it should work out of the box. (The file size bloat is actually an issue too, installer download size, docker image for ad-hoc GPU cloud deployment, inpaint models do not scale).
So yes I can put a checkbox. But are people going to know they need to use it, or are they going to report a bug? Or worse, just conclude the model is no good.
Anyway if it's not possible to detect there is probably not much we can do about it. I tested a bit more directly in the plugin, and nothing really breaks, even the pipeline with inpaint CN + IP-adapter. Results look sketchy in some cases, but it's hard to compare because I'm not sure how much is attributed to the checkpoint, and how much to eg. IP-adapter not working super well with it. For an initial integration it's good enough though.
Understandable, we do get people practically every day who forget to download the config for zero terminal SNR models even with prominent warnings and wonder why the output doesn't look right.
A tangent since you mentioned it, the stable-diffusion-inpainting model is actually not something that needs to be provided by every model, normally you just add the inpainting model with whatever SD1.5 model you have after subtracting the base SD1.5 weights from it. I only provided one with my model because I haven't been successful in getting inpaint merges to work with models retrained on v-prediction and thought that retraining would be more reliable. Here is an example inpainting workflow (from a blank image and mask all) that does this:
The first checkpoint can be any Stable Diffusion checkpoint. ComfyUI seems to manage memory for this well enough so as long as you don't expect disk speed to be a major issue, you won't need to cache these models.
There is a small issue in that ComfyUI's merge nodes do not account for the shape mismatch on the conv_in layer like A1111 does (the inpainting model adds more channels to the input for the mask and masked image, and that layer is therefore larger). Fortunately, that layer doesn't change much during training anyways, and if you use the inpainting model as the first one on the add operation it will just take the inpainting model's conv_in layer as is and work fine, although it would probably work better if they were merged. I will look into pull requesting a fix for that in ComfyUI so it works better, but it should be fairly safe to implement now if you're inclined.
we do get people practically every day who forget to download the config for zero terminal SNR models even with prominent warnings and wonder why the output doesn't look right.
Something in the safetensors metadata would be enough to do it automatically. Not sure how realistic it is to get it into just some models, but not all need to support it to be useful, it would allow enabling the alternative sampling mode automatically for those that have it.
Basic option is available in 1.11.0