diffusers
diffusers copied to clipboard
Merging Stable diffusion pipelines just makes sense
Following the Philosophy, it has been decided to keep different pipelines for Stable Diffusion for txt-to-img, img-to-img and inpainting.
Here is the result:
- PR #549 : code duplicated 4 times (onnx included)
- PR #537 : modification only applied to txt-to-img pipeline
- PR #521 : code duplicated 4 times (onnx included)
- PR #511 : modification only applied to txt-to-img pipeline
- PR #506 : modification only applied to txt-to-img pipeline
- PR #472 : code duplicated 4 times (onnx included)
- PR #371 : modification only applied to txt-to-img pipeline
It's getting even worse! Now with onnx and the obvious request to have img-to-img and inpainting pipelines with onnx (#510), that would mean that every modification will need to be duplicated 6 times in total, with 6 times the number of tests.
This in untenable...
I just noticed that in PR #549, the code was not duplicated exactly the same way. Sometimes multiplying by batch_size
, sometimes not.
Proof again than duplicating code is never good and introduce bugs.
I'm new here, but I also noticed, that most of the code is just duplicated, which is very hard when you want to make changes to all three :(
Good suggestion. I agree with you.
Hi @leszekhanusz you raise an important question, as the number of Stable Diffusion pipelines is rising, and we don't have a robust copy mechanism like the one in transformers
yet.
Inviting @patil-suraj @patrickvonplaten @pcuenca to discuss possible solutions here :)
In the meantime, I'll address the mismatching pipeline issues manually, thank you for a thorough compilation!
We can just add copy-mechanism, we have in Transformers.
Quite strongly against creating a unified pipeline as it makes it impossible down the road to define classes like AutoPipelineForText2Image
.
Also note that default values cannot be coherently defined anymore. In text2image we use 50 inference steps in image2image just 30 => I don't want to clutter the code with if type == ...
statements.
We'll sure not merge onnx and non onnx pipeline because one should not have onnx installed to be able to use the pipeline and also optimization might very well be pipeline specific.
For more general arguments regarding this design see also here: https://huggingface.co/blog/transformers-design-philosophy
Overall:
- Happy to add a "one-size-fits-it-all" community pipeline which we'll also support in
pip install diffusers
in 1,2 weeks by mirroring community pipelines an the Hub. If people use this extensively, we can also add this tosrc/diffusers
- User experience is much more important for us than maintenance. We should not write code that is only concerned with improving our maintenance, we should mostly write code that is easy to use and won't have us run into big backwards breaking problems in the future.
What do you think @leszekhanusz ?
I've read your blog post and you make some good points but some of it seems to contradict you:
we decided to adopt a different design principle which we like to call the single model file policy. The single model file policy states that all code necessary for the forward pass of a model is in one and only one file.
Here we have a single model (Stable diffusion) which has 3 different pipelines (ignoring onnx for now) in different files doing more or less the same thing. We could argue that having all the code in a single pipeline is actually following your single model file policy.
Did you look at what the code looks like in the unified pipeline? You have if init_image is None:
and if mask_image is not None:
conditions and it's really not complicated to follow those branches.
On the contrary, now it's getting easier to see the difference between img-to-img and inpainting than to try to compare code in different files.
For someone like me which is quite new to diffusion models, what's complicated is to understand the model, what is classifier free guidance, schedulers, torch broadcasting and shapes of tensors. Tuning out a block if mask_image is not None:
is a piece of cake in comparison and allow to see more easily the difference between the different usages.
We can just add copy-mechanism, we have in Transformers.
To be honest, from an external eye, this seems like a hack to me.
Quite strongly against creating a unified pipeline as it makes it impossible down the road to define classes like AutoPipelineForText2Image.
I don't see why. If you set init_image
and mask_image
to None, then the pipeline is the same as the text-to-img pipeline.
Also note that default values cannot be coherently defined anymore. In text2image we use 50 inference steps in image2image just 30 => I don't want to clutter the code with if type == ... statements.
First of all, for most use-cases the user will provide the number of inference steps himself as it is quite an important parameter which he would probably like to make configurable.
But if you want to keep backward compatibility and allow custom default for each pipeline, I propose to use create a StableDiffusionCommonPipeline which will contain the unified pipeline. Then the current StableDiffusionPipeline used only for text-to-image could be replaced by something like:
class StableDiffusionPipeline(StableDiffusionCommonPipeline):
def __call__(
self,
prompt: Union[str, List[str]],
height: Optional[int] = 512,
width: Optional[int] = 512,
num_inference_steps: Optional[int] = 50,
guidance_scale: Optional[float] = 7.5,
eta: Optional[float] = 0.0,
generator: Optional[torch.Generator] = None,
latents: Optional[torch.FloatTensor] = None,
output_type: Optional[str] = "pil",
return_dict: bool = True,
**kwargs,
):
super().__call__(
prompt=prompt,
height=height,
...
init_image=None,
mask_image=None,
...
)
We'll sure not merge onnx and non onnx pipeline because one should not have onnx installed to be able to use the pipeline and also optimization might very well be pipeline specific.
Sure, I agree.
Happy to add a "one-size-fits-it-all" community pipeline which we'll also support in pip install diffusers in 1,2 weeks by mirroring community pipelines an the Hub. If people use this extensively, we can also add this to src/diffusers
That's great! From the user point of view that is really important. That "one-size-fits-all" pipeline should really be inside src/diffusers so that it could be integrated easily in other projects and automatically updated with an upgrade of diffusers.
User experience is much more important for us than maintenance. We should not write code that is only concerned with improving our maintenance, we should mostly write code that is easy to use and won't have us run into big backwards breaking problems in the future.
That's good to hear, but the user experience could be better right now.
Another aspect to this from a user experience perspective is that right now naively loading all pipelines via from_pretrained takes three times the GPU memory. If you want to use all three in the same script or gradio interface you have to either load the models separately and init the pipelines via the initializer manually or juggle the pipelines in and out of your GPU memory.
@okalldal it's possible to merge all the existing pipelines into a single one. Available here.
Very good points here! Thanks a lot for writing all of this down. I still don't agree with most of it here though :sweat_smile: Writing my thoughts down a final time before giving in haha.
We could argue that having all the code in a single pipeline is actually following your single model file policy. Clear no. Single model file policy means exactly to not build one model that fits it all but to have multiple modeling files. This was misunderstood. This was written for
transformers
so it cannot be applied one-to-one here.
On the contrary, now it's getting easier to see the difference between img-to-img and inpainting than to try to compare code in different files. Don't agree either. We now cannot set sensible defaults anymore, clutter the
__call__
with input args (likestrength
latents
mask_image
....) How would one know what guidance scale should be used for the pipelines, what num inference steps to set? It's not true that most people understandnum_inference_steps
well enough that it doesn't require a good default. People will mix functionality for different tasks and we'll get issues about I believe. Let's see!
A point that is not taken into account a lot here IMO: Users of diffusers
include both "end-users" and "packages".
"packages" are the ones that build on top of diffusers
and many good "packages" is more important than direct "end-users". PyTorch is so highly used because every ML package can super easily build on it, it has an intuitive design, ... not because they started adding a stable diffusion model in their core library.
- "end-users" usually prefer very much the "one-size-fits-it-all" solutions which makes a lot of sense! Writing the same code every time for your use case is annoying! But note that we will never be able to full fill everyone's use case!
- "packages" are libraries that integrate this library into other libraries & Python packages and thus favor a cleaner logical API, having a clear distinction between different tasks, unified API, no backwards breaking problems. Unified API will be broken with this pipeline because it's not a "task" pipeline anymore but a "diffusion model specific" pipeline class. It won't fit in this table: https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines#pipelines-summary -> we'd have to give it multiple task names. This will have more consequences in the future. It won't fit in a
"AutoText2ImagePipeline"
class. E.g. having all image-to-image, in-painting, and text2image is very stable diffusion specific. Having everything in one pipeline make us prone to future breaking changes. Most arguments here fall away for the "library" users: - I can build a 10-line one size fits it all pipeline that uses no more memory:
from diffusers import StableDiffusionPipeline, StableDiffusionImg2ImgPipeline, StableDiffusionInpaintPipeline
from diffusers import DiffusionPipeline
stable_diffusion = DiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
sub_models = {k: v for k,v in vars(pipeline).items() if not k.startswith("_")}
text2img = StableDiffusionPipeline(**sub_models)
img2img = StableDiffusionImg2ImgPipeline(**sub_models)
inpaint = StableDiffusionInpaintPipeline(**sub_models)
def all_in_one(type, *args, **kwargs):
if type == "text2img":
return text2img(*args, **kwargs)
...
Note: {k: v for k,v in vars(pipeline).items() if not k.startswith("_")}
is very hacky but we could very well provide a nice function for this in the downstream DiffusionPipeline
class (cc @patil-suraj @anton-l ) I think this would be useful anyways.
- Why is this better? This way the
diffusers
repo can focus on making clean task-specific pipelines and doesn't have to put the burden of maintenance higher lever functionality on us. E.g. now image stable diffusion has two more task that can be done with the same model => having an official one-fits-it-all pipeline puts us then in a tough spot: "where do we draw the line", "can we add the new features without making the code unreadable? Do we have to break existing behavior?" => it becomes obvious that the one-size-fits it all will quickly break - I think a big problem is that we don't advertise well enough how
diffusers
is intended to be used. E.g. it's really not that hard to build something on top ofdiffusers
that doesn't require 3x the memory usage IMO
Nevertheless, I do want to have this library be community driven, but I think the above aspects and the long-term vision for the library is not taken into account enough.
=> So if you want, happy to follow your advice, but added it as a "experimental" pipeline. However I won't maintain the pipeline and ping you quite a bit on issues I think @leszekhanusz haha ;-). Thanks for taking the time to write everything down though - "easy-of-use" for such a young library might indeed be the winning argument in the end of the day.
Think this is a cool discussion, so will also asked the Twitter community on their opinion :-)
Also cc @keturn @shirayu @johnowhitaker @jonatanklosko @joaoLages @pcuenca @patil-suraj @anton-l @kashif @sgrigory very interested in hearing your final thoughts!
Also here a twitter poll: https://twitter.com/PatrickPlaten/status/1572980566579941381
I think I'm on team 'three classes'. If you want to do a quick demo of one application, easy to just load, say, text2img. If you want multiple your example is very helpful, it'd be nice to document what the sub_models are and how to avoid loading duplicates like you do there. I also think a lot of people trying more complex things won't be loading three pipelines, they'll be defining their own sampling process anyway - where they can add all the extra features they want. That being said, I'm not the one having to maintain the code ;)
I understand the '3 classes' arguments but, as @leszekhanusz mentioned, having the code duplicated for all of them is very error-prone.
IMO, the solution proposed by @leszekhanusz of having the backbone class is what I thought of.
To exemplify:
class CoreStableDiffusionPipeline:
...
def _forward(prompt, init_image: Optional = None, mask_image: Optional = None, ...):
....
class StableDiffusionPipeline(CoreStableDiffusionPipeline):
...
def __call__(prompt, ...):
return self._forward(prompt, ...)
class StableDiffusionImg2ImgPipeline(CoreStableDiffusionPipeline):
...
def __call__(prompt, init_image, ...):
assert init_image is not None # avoid incorrect class usage
return self._forward(prompt, init_image, ...)
class StableDiffusionInpaintPipeline(CoreStableDiffusionPipeline):
...
def __call__(prompt, init_image, mask_image, ...):
assert init_image is not None and mask_image is not None # avoid incorrect class usage
return self._forward(prompt, init_image, mask_image, ...)
This way we would still keep the 3 classes separated and the code would be way easier to maintain. Do you guys see any disadvantages in this solution?
Yes! People now need to switch between multiple files CoreStableDiffusionPipeline
and StableDiffusionInpaintPipeline
. Maintaining different files is really not that big of a problem and much less of a concern in OSS than in industry settings
Yes! People now need to switch between multiple files
CoreStableDiffusionPipeline
andStableDiffusionInpaintPipeline
.
If that's a problem, the CoreStableDiffusionPipeline
subclasses could be all in the same file.
This solution is better to avoid adding/removing functionality in one of the pipelines but not in the others, that's the way I see it.
As you say @patrickvonplaten it is not really difficult to get the pipelines to share the same models now either. The non hack way that already works is to just load each model separately via subdirectory instead of from a pipeline. Maybe a simple example for documentation is enough for most end-users in this regard and merging the pipelines are overkill. It's just that one tends to get spoiled by the "one liner that just works" huggingface experience that we all know and love 😉
Maintaining different files is really not that big of a problem and much less of a concern in OSS than in industry settings
That hasn't been my experience.
but there is a lot in this thread I could potentially ramble about, let me see if I can find a few things to focus on.
As for txt2img, img2img, and inpaint: I wouldn't want the interface for those three different tasks to all be squished in to one function. But classes are collections of data and methods, and
- these three tasks have the same set of data that needs to be prepped beforehand (the tokenizer & text model, the autoencoder, and the diffusion model).
- these tasks have a very high degree of overlap in their implementation.
That suggests to me that however we define the interface to those three tasks, they should share almost all the same data structures and implementations, leaving us better able to focus on what the distinguishing factors between the tasks really are.
Side note: this practice of having a class with basically one method (__call__
in this case) is not a super common pattern in my experience. I don't hate it, not at all, but you know you are allowed to define more than one public method on a class, right? 😉
Your example here:
stable_diffusion = DiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
sub_models = {k: v for k,v in vars(pipeline).items() if not k.startswith("_")}
text2img = StableDiffusionPipeline(**sub_models)
is super important! This is the thing that most of us new to the library do not understand at first. By now I understand it probably works like that, but there are still some aspects of it I'm still wary of. How safe is it to share those models between tasks? Are they immutable? Do the "sub-models" depend on their siblings at all? e.g. if I want a different scheduler or feature extractor in my second pipeline, is anything going to be mis-configured or accidentally cross-referenced to the one from the first pipeline?
In any case, I think emphasizing that with a prominent example (with that function to replace the dict comprehension) would go a long way to resolving the "multiple pipeline instances are too expensive in load time + RAM" concerns.
having an official one-fits-it-all pipeline puts us then in a tough spot: "where do we draw the line", […]
This is a big thing I'm not clear on too. What's the intended scope of 🤗🧨diffusers?
There's obvious demand for an image-synthesis engine to provide the implementation for web apps, other gRPC clients, desktop apps like Krita/GIMP/Blender/Photoshop, etc. There's a bunch of common tasks they all want to do, and they're all going to be built by people who don't know which tasks share which submodels with the pipelines for which other tasks, and they don't really care. They just want some memory-efficient implementation that gives them access to the whole toolbox, with maybe a couple knobs to turn if there are significant speed/memory trade-offs available.
(I'm talking about image synthesis here, but I expect the same will apply for voice and music as soon as those audio models drop.)
Does 🧨diffusers want to be that? Or should we plan for that to be a different library instead? (and who is hiring me to build it? 😄)
There are several topics being discussed here.
If we want to prevent the problem that people may load modules in inefficient ways, I think the best course of action would be to document how to do it with a snippet similar to @patrickvonplaten's (loved it lol). And fully address the very reasonable concerns that were raised about potential interdependencies. I think this would also make it clearer how pipelines are made up of pluggable components, which would be very helpful and instructive.
Another debate seems to revolve around code repetition, which is mostly about maintainability in my opinion. I agree that from a pure software practice perspective it feels a little bit "cringy", but as Patrick explained it's a conscious compromise. Maintainers take the hit of having to repeat portions of the code in exchange for code clarity and, very importantly, future evolution. I'd rather work towards something like AutoPipelineForText2Image
than trying to come up with perfect abstractions for the current code, and having to revise them every time a new task, feature or model is introduced. Inconsistencies should arise in tests, and that should be another place to put effort into.
There could exist an intermediate abstraction layer in the blocks and data structures used by pipelines, as @keturn pointed out, even if pipelines are still independent. This is a good point that is worth exploring, also keeping in mind that other modalities and tasks may exist in the future.
I may be mistaken, but I think it could be beneficial to advertise the existing pipelines as examples of how to use the building blocks, and work towards those "one liners that just work" that @okalldal mentioned. I'm not sure that a single complex pipeline object with a lot of parameters would help towards that goal.
+1 for factoring out shared functionality. Duplicate code is an issue, and it means there is work to be done in order to achieve good code quality here. DRY is a core best practice in software engineering.
I wanted to better understand where the current approach is coming from, so I went back and re-read the blog post in more detail. There is is a reasonable motivation behind the original philosophy (making model code easily understandable for developers new to the library) but this case highlights two core issues:
-
The philosophy does not adequately emphasize the drawbacks of duplicate code. It reads like "we don't have to follow DRY at all" whereas it should at most say "we are breaking DRY under very specific circumstances, and otherwise our codebase must still follow this principle."
-
As a result, the philosophy is being applied far too broadly. The original article refers to shared logic for different models after development is complete, but here we see duplicate code for three different pipelines running the same model, which are still under development. This repo is not following the single model file policy anyways (e.g. U-Net is spread across multiple files that include implementation details for all three of these pipelines). Instead we just have duplicate pipeline code. The response that SD is not 1-to-1 with Transformers does not provide a concrete rationale for this duplicate code; if anything, that is another reason why the Transformers philosophy is inapplicable in this case.
If you have to keep the duplicate code in these pipelines, please at least add comments that clearly identify which blocks of code are duplicated, and which other files need to be updated when a developer changes one of them. The diffusers copy mechanism and better unit testing should be made priorities in order to support this, but in general, duplicate code is not the optimal solution here.
Overall, the primary benefit of this philosophy is being undermined in Stable Diffusion's case. This benefit is to improve the experience of a developer that is new to the library, by making it easier to understand how a model works. However, as we can see from this discussion, it is now being applied too broadly and actually degrading the development experience instead. As to package users, better code quality will always be preferable from the standpoint of a downstream package, because we need to be able to trust that new releases will not introduce new bugs (when we upgrade). Regarding the points raised about the API, this design decision should not be coupled with any aspect of the API. I'm building a downstream package currently and will probably rebuild all of this pipeline functionality to avoid these issues. That is why the decision to sacrifice code quality does not make sense in this case.
Very nice ideas in this thread here!
To me the main concern seems to be about not factoring out shared functionality and code. I also agree that this could help in better understand differences Happy to factor out the code that all pipelines share into a function and then make sure this stays identical for the other pipelines.
Big +1 on @pcuenca that pipelines are rather supposed to be examples. We hope that most downstream application directly make use of schedulers
and models
instead of the very high level pipelines.
Will focus on two things from this PR:
- add a function to retrieve all components
- add better docs for this
- factor out a function that includes shared common code
- add copy mechanism so that the function can stay in the pipeline file but we ensure identical code across pipelines
@patrickvonplaten, awesome! Sounds good to me. Thank you for making these updates, and I'll make sure to follow the advice regarding examples / schedulers
and models
as well. Looking forward to checking out the refactored pipelines.
I've been away from Python for a while so I hope these two questions aren't too painful to answer.
- Why not a base Pipeline class that the other three can extend, that can contain duplicated code? My reasoning is that you could then look at each of the three pipeline files and the code inside them would be what makes them different to a Base/basic pipeline.
I see there is already a base class of sorts, which leaves me confused. It could just be that you needed to get to this point to see which code can actually be moved to the base class.
- Why does self-contained mean 1 file in this context but the 3 pipelines are already extending the base class (another file)?
PR#700 just before my comment seems to be someone who has implemented a similar idea albeit not passing tests quite yet.
Will wait with the refactor here until the schedulers API is updated: https://github.com/huggingface/diffusers/issues/336
It's odd seeing this entire discourse and having gone through the entire thing myself, and one of the things that kept me away from ever using diffusers
. Yes, the unified "Pipeline" makes sense, and after having gyrated between diffusers
and CompVis
, but being frustrated by the ability to add new samplers to diffusers
and many SoTA features/optimizations coming out for CompVis
I just wrote my own unified "Pipeline" using CompVis
.
https://github.com/AmericanPresidentJimmyCarter/stable-diffusion/blob/main/src/stable_inference/sampling.py#L358-L457
It's great, it's easy to use, and it's easy to manipulate if I need to play with it. I don't know why there's so much push-back against it.
Hey @leszekhanusz , @keturn , @shirayu and everybody who is interested to review!
1.) - I've made a first draft of a "One-for-all-tasks Stable Diffusion" pipeline here: https://github.com/huggingface/diffusers/pull/821 . I'd be very happy if you could review the PR and give feedback on the naming (think "mega" is not a great name, but couldn't think of anything better).
Note: This is orthogonal to:
2.) - https://github.com/huggingface/diffusers/issues/551#issuecomment-1259415393 which I'll will try to tackle early next week.
So 2.) is more about more explicative and maintainable code and 1.) is to satisfy a "one-in-all" pipeline.
Keen to hear your thoughts :-)
@leszekhanusz I agree with you completely. It looks like the pipeline is divided and organized, but when you actually use it, the UX is obviously bad.
@kosukekurimoto thanks for the feedback, could you share some more detail in what is bad? Happy to try to make it cleaner next week!
@patrickvonplaten It uses extra memory if the classes are separated😇
e.g.
class Predictor(BasePredictor):
def setup(self):
"""Load the model into memory to make running multiple predictions efficient"""
print("Loading pipeline...")
scheduler = PNDMScheduler(
beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear"
)
self.pipeImg2Img = StableDiffusionImg2ImgPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4",
scheduler=scheduler,
revision="fp16",
torch_dtype=torch.float16,
cache_dir=MODEL_CACHE,
local_files_only=True,
).to("cuda")
self.pipeInPaint = StableDiffusionInpaintPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4",
scheduler=scheduler,
revision="fp16",
torch_dtype=torch.float16,
cache_dir=MODEL_CACHE,
local_files_only=True,
).to("cuda")
def predict(〜):
if predict_type == "text2img":
# use stable-diffusion model
self.pipeText2Img.scheduler = scheduler
output = self.pipeText2Img(
prompt=[prompt] * num_outputs if prompt is not None else None,
width=width,
height=height,
guidance_scale=guidance_scale,
generator=generator,
num_inference_steps=num_inference_steps,
)
elif predict_type == 'img2img':
# use material-stable-diffusion model
self.pipeImg2Img.scheduler = scheduler
output = self.pipeImg2Img(
prompt=[prompt] * num_outputs if prompt is not None else None,
init_image=init_image,
strength=prompt_strength,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
generator=generator,
)
〜
Hey everybody,
We've now merged the "all-in-one" community pipeline which doesn't require any additional memory to run all pipelines. Please have a look at: https://github.com/huggingface/diffusers/tree/main/examples/community#stable-diffusion-mega
- Components PR to make it easier to save memory between pipelines: https://github.com/huggingface/diffusers/pull/889
Will try to tackle the main "factor out functions" PR tomorrow