diffusers Let prediction type be a property on unet instead of the sampler

What API design would you like to have changed or added to the library? Why?

I think the prediction type is a property of the unet since it describes what it is outputting. It would be nice if it was part of the unet config rather than the sampler. I am loading the different parts of stable diffusion v2 separately and having the prediction type in the scheduler was unexpected for me.

What use case would this enable or better enable? Can you give us a code example?

Instead of

sampler.config.prediction_type  # "v_prediction" or "eps"

It would be nice if it was a property of the unet e.g.

unet.prediction_type  # "v_prediction" or "eps"

Nov 27 '22 20:11 samedii

What API design would you like to have changed or added to the library? Why?

I think the prediction type is a property of the unet since it describes what it is outputting. It would be nice if it was part of the unet config rather than the sampler. I am loading the different parts of stable diffusion v2 separately and having the prediction type in the scheduler was unexpected for me.

What use case would this enable or better enable? Can you give us a code example?

Instead of
sampler.config.prediction_type  # "v_prediction" or "eps"
It would be nice if it was a property of the unet e.g.
unet.prediction_type  # "v_prediction" or "eps"

Nov 29 '22 07:11 Santanathegreat

Hey @samedii,

I think it could make sense to have prediction_type to be part both of the scheduler and the model. The scheduler needs to know the type to use the correct formula, the unet should know the type since it was trained with this criterion. However the problem with adding it to the unet is that it's not inherently an urgument of the model, rather just the training criterion and the exact same model architecture could be used for both v-prediciton and eps-prediction so I'm worried about people forgetting to update this config argument when fine-tuning etc...

@anton-l @patil-suraj @pcuenca what do you think?

Dec 01 '22 16:12 patrickvonplaten

I'm not sure what a good level of abstraction is. At the moment the Unet is a building block has no identifying information or any idea about how it's been trained. Maybe that's fine.

A positive of if the prediction type was in the Unet is that you could have a method like .predicted_noise that abstracts away the conversion logic from the schedulers.

Good point about people possibly fine tuning with another criterion. I don't think putting the value in the scheduler really fixes that though.

At the moment I solved this internally by wrapping the scheduler and denoiser together in my own class. This feels like a reasonable solution. It's mostly saving and loading that is painful.

I am trying to use the modules separately rather than via the pipelines so the structure felt unintuitive for me at first but maybe it's okay. (i.e. it doesn't make sense to load the denoiser without also loading the scheduler to find out what the denoiser is)

Dec 01 '22 18:12 samedii

At the moment, I cannot think of a better design here either. Note that SD 2 768 was fine-tuned in v-prediction from SD 2 base is epsilon mode (see model card here)

So here it would be problematic if people "forgot" to update the unet criterion after SD 2 base epsilon mode training.

Dec 02 '22 16:12 patrickvonplaten

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Dec 28 '22 15:12 github-actions[bot]