diffusers [SD3] vae.config.shift_factor missing in dreambooth training examples

Describe the bug

shift_factor missing in traning code: https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora_sd3.py#L1617, but used in inference code: https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py#L893 Is it resonable that when traning SD3, we do not need to norm latents using vae.config.shift_factor and scale_factor?

Thinks

Reproduction

None

Logs

No response

System Info

None

Who can help?

@sayakpaul

Jun 26 '24 07:06 bendanzzc

Good observation! Thank you for bringing this up!

Yeah, ideally, a reversal of the following would be needed: https://github.com/huggingface/diffusers/blob/0f0b531827900d805f8d2d0a42c1040a1e34bf07/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py#L893

Do you want to give it a try and open a PR, perhaps? Happy to help you through the process.

Jun 26 '24 07:06 sayakpaul

Thanks, I'd like to try

Jun 26 '24 07:06 bendanzzc

Lovely. Thanks so much.

Jun 26 '24 08:06 sayakpaul

I've just implemented this in my own training code which is largely based on the diffusers example, and it does seem to noticeably help with image crispness in some with/without tests on the same training data (though non-deterministic choice of seed, image ordering, prompt shuffling, etc).

Jun 29 '24 10:06 CodeExplode

Do you wanna show some comparisons?

Jun 29 '24 10:06 sayakpaul

I only kept one image sorry. I tried training with the character Ahsoka as the toughest example in my dataset.

The left is training without handling shift, right is with handling shift, on approximately the same prompt (with some shuffling) at about the same number of steps. Without applying shift they all looked blurry like the left sample (after a few epochs), whereas with shift there was a mix of blurry and crisp previews, so it seemed to be helping. The samples were always generated with shift from the start as they use different code.

SD3_AhsokaTrainingExample

This was full finetuning, rather than LoRA training.

Jun 29 '24 10:06 CodeExplode

Fixed https://github.com/huggingface/diffusers/pull/8917/

Jul 25 '24 07:07 sayakpaul