diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

[SD3] vae.config.shift_factor missing in dreambooth training examples

Open bendanzzc opened this issue 1 year ago • 6 comments

Describe the bug

shift_factor missing in traning code: https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora_sd3.py#L1617, but used in inference code: https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py#L893 Is it resonable that when traning SD3, we do not need to norm latents using vae.config.shift_factor and scale_factor?

Thinks

Reproduction

None

Logs

No response

System Info

None

Who can help?

@sayakpaul

bendanzzc avatar Jun 26 '24 07:06 bendanzzc

Good observation! Thank you for bringing this up!

Yeah, ideally, a reversal of the following would be needed: https://github.com/huggingface/diffusers/blob/0f0b531827900d805f8d2d0a42c1040a1e34bf07/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py#L893

Do you want to give it a try and open a PR, perhaps? Happy to help you through the process.

sayakpaul avatar Jun 26 '24 07:06 sayakpaul

Thanks, I'd like to try

bendanzzc avatar Jun 26 '24 07:06 bendanzzc

Lovely. Thanks so much.

sayakpaul avatar Jun 26 '24 08:06 sayakpaul

I've just implemented this in my own training code which is largely based on the diffusers example, and it does seem to noticeably help with image crispness in some with/without tests on the same training data (though non-deterministic choice of seed, image ordering, prompt shuffling, etc).

CodeExplode avatar Jun 29 '24 10:06 CodeExplode

Do you wanna show some comparisons?

sayakpaul avatar Jun 29 '24 10:06 sayakpaul

I only kept one image sorry. I tried training with the character Ahsoka as the toughest example in my dataset.

The left is training without handling shift, right is with handling shift, on approximately the same prompt (with some shuffling) at about the same number of steps. Without applying shift they all looked blurry like the left sample (after a few epochs), whereas with shift there was a mix of blurry and crisp previews, so it seemed to be helping. The samples were always generated with shift from the start as they use different code.

SD3_AhsokaTrainingExample

This was full finetuning, rather than LoRA training.

CodeExplode avatar Jun 29 '24 10:06 CodeExplode

Fixed https://github.com/huggingface/diffusers/pull/8917/

sayakpaul avatar Jul 25 '24 07:07 sayakpaul