What does this PR do?

Adds ResAdapter to diffusers community features.

Fixes #7243.

Paper: https://arxiv.org/abs/2403.02084 Code: https://github.com/bytedance/res-adapter Project page: https://res-adapter.github.io/

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ] Did you read the contributor guideline?
[ ] Did you read our philosophy doc (important for complex PRs)?
[ ] Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@jiaxiangc @sayakpaul @DN6 @yiyixuxu @rootonchair

Mar 16 '24 21:03 a-r-r-o-w

@jiaxiangc With lora rank=8, I get around 0.5M trainable parameters which seems to be consistent with the information from the paper. I've followed your suggestions from here and believe I've done it correctly, so would be awesome if you could review when free.

I am currently running a training experiment on an A6000 with 1000 training samples but the results do not look good (I'm thinking the dataset is to blame since the split I'm using contains very different images, and lower quality, from what RealisticVision5.1 was finetuned with based on manual inspection. The resolution of images seem to be somewhat limited as well):

Script

#!/usr/bin/bash

python3 train_resadapter.py \
  --pretrained_model_name_or_path krnl/realisticVisionV51_v51VAE \
  --dataset_name poloclub/diffusiondb \
  --dataset_config_name 2m_random_1k \
  --image_column image \
  --caption_column prompt \
  --validation_prompt "beautiful face, youthful appearance, ultra focus, face iluminated, face detailed, ultra focus, dreamlike images, pixel perfect precision, ultra realistic;Award-winning photo of a mystical fox girl fox in a serene forest clearing, sunlight" \
  --validation_prompt_sep ";" \
  --num_validation_images 5 \
  --validation_epochs 1 \
  --validation_heights 256 384 768 768 1024 \
  --validation_widths 256 832 768 1280 1024 \
  --validation_inference_steps 40 \
  --output_dir sd-resadapter \
  --cache_dir . \
  --seed 42 \
  --nearest_resolution_multiple 64 \
  --random_flip \
  --train_batch_size 4 \
  --num_train_epochs 20 \
  --gradient_accumulation_steps 1 \
  --learning_rate 1e-5 \
  --adam_beta1 0.95 \
  --adam_beta2 0.99 \
  --checkpointing_steps 500 \
  --rank 16 \
  --report_to wandb

Relevant logs:

03/16/2024 19:33:13 - INFO - __main__ - Found 31 unique image sizes
03/16/2024 19:33:13 - INFO - __main__ - Keys: [torch.Size([3, 640, 640]), torch.Size([3, 512, 1024]), torch.Size([3, 512, 512]), torch.Size([3, 704, 512]), torch.Size([3, 512, 768]), torch.Size([3, 1024, 1024]), torch.Size([3, 768, 512]), torch.Size([3, 512, 896]), torch.Size([3, 512, 704]), torch.Size([3, 896, 512]), torch.Size([3, 640, 512]), torch.Size([3, 832, 512]), torch.Size([3, 512, 640]), torch.Size([3, 832, 1024]), torch.Size([3, 576, 1024]), torch.Size([3, 384, 832]), torch.Size([3, 768, 768]), torch.Size([3, 512, 832]), torch.Size([3, 1024, 512]), torch.Size([3, 704, 1024]), torch.Size([3, 512, 960]), torch.Size([3, 512, 1088]), torch.Size([3, 832, 832]), torch.Size([3, 512, 1472]), torch.Size([3, 576, 576]), torch.Size([3, 768, 1024]), torch.Size([3, 1024, 704]), torch.Size([3, 768, 1280]), torch.Size([3, 1024, 768]), torch.Size([3, 1024, 832]), torch.Size([3, 832, 640])]

Training logs: https://wandb.ai/aryanvs/text2image-fine-tune/runs/ogmrnvve

Mar 16 '24 22:03 a-r-r-o-w

Hi @a-r-r-o-w , thank you for the good work. I also think 1000 images is small to achieve good result. I would suggest using a larger dataset like CC12m. Besides, have you implemented a probability function to choose different resolution as mentioned by the author in Sec 4.3?

Mar 17 '24 08:03 rootonchair

Hey. Have not implemented a probability function for selecting ratios as such. See SizeBatchSampler implementation which creates buckets of different image sizes. Prior to creating buckets, images are preprocessed to the nearest resolution multiple of 64 in my training run above, which helps reduce number of buckets. From each bucket, batch size amount of images are selected and once we have all batches, they are randomly shuffled.

Btw, training this on A6000/RTX 4090 takes quite long tbh which makes it hard to test with different datasets. I doubt c12m will have a significant impact here but feel free to try since I have limited compute. The authors train for 20000 steps and single epoch over LAION-5B iirc with batch size 32. I tried a run with 50000 images (49 unique image size buckets) for 2 epochs which was about 4000 steps (and took ~2.5 hours) but the results still didn't seem to be the same as authors on RV5.1. It was definitely starting to get better so I think more images should help.

Mar 17 '24 09:03 a-r-r-o-w

Can we add this to research_projects folder to begin with?

Mar 17 '24 15:03 sayakpaul

Can we add this to research_projects folder to begin with?

Sure! Btw, if this, or any research/project in general, is something of interest to the HF team, would it be possible to grant compute for finetuning and creating improved checkpoints? AFAIK diffusers is meant to be an accumulation of research that is widely used within the community and training/finetuning is not as high of a priority as replicating inference behavior is, so I believe the answer is a no, but I'd be really happy to access one of higher end chips (if yes) for replicating open research and work on training scripts as it'd help understand how things work in practice :) The reason I ask is because there are many issues on multiple research repos asking for training code but the requests are simply ignored, possibly due to company policies. I believe there are many that would like to adapt the imaginary training code for their own use cases, or just use it for the sake of openly replicating results for pursuing ideas around improvements.

Mar 20 '24 12:03 a-r-r-o-w

That is also a bit contextual. For high value-add papers (where we have a certain artifact), we can try providing compute grants.

ADD from Stability AI is a good example of this. They made the model open but didn't probably release the code.

Mar 20 '24 12:03 sayakpaul

@jiaxiangc With lora rank=8, I get around 0.5M trainable parameters which seems to be consistent with the information from the paper. I've followed your suggestions from here and believe I've done it correctly, so would be awesome if you could review when free.

I am currently running a training experiment on an A6000 with 1000 training samples but the results do not look good (I'm thinking the dataset is to blame since the split I'm using contains very different images, and lower quality, from what RealisticVision5.1 was finetuned with based on manual inspection. The resolution of images seem to be somewhat limited as well):

Script

#!/usr/bin/bash

python3 train_resadapter.py \
  --pretrained_model_name_or_path krnl/realisticVisionV51_v51VAE \
  --dataset_name poloclub/diffusiondb \
  --dataset_config_name 2m_random_1k \
  --image_column image \
  --caption_column prompt \
  --validation_prompt "beautiful face, youthful appearance, ultra focus, face iluminated, face detailed, ultra focus, dreamlike images, pixel perfect precision, ultra realistic;Award-winning photo of a mystical fox girl fox in a serene forest clearing, sunlight" \
  --validation_prompt_sep ";" \
  --num_validation_images 5 \
  --validation_epochs 1 \
  --validation_heights 256 384 768 768 1024 \
  --validation_widths 256 832 768 1280 1024 \
  --validation_inference_steps 40 \
  --output_dir sd-resadapter \
  --cache_dir . \
  --seed 42 \
  --nearest_resolution_multiple 64 \
  --random_flip \
  --train_batch_size 4 \
  --num_train_epochs 20 \
  --gradient_accumulation_steps 1 \
  --learning_rate 1e-5 \
  --adam_beta1 0.95 \
  --adam_beta2 0.99 \
  --checkpointing_steps 500 \
  --rank 16 \
  --report_to wandb

Relevant logs:

03/16/2024 19:33:13 - INFO - __main__ - Found 31 unique image sizes
03/16/2024 19:33:13 - INFO - __main__ - Keys: [torch.Size([3, 640, 640]), torch.Size([3, 512, 1024]), torch.Size([3, 512, 512]), torch.Size([3, 704, 512]), torch.Size([3, 512, 768]), torch.Size([3, 1024, 1024]), torch.Size([3, 768, 512]), torch.Size([3, 512, 896]), torch.Size([3, 512, 704]), torch.Size([3, 896, 512]), torch.Size([3, 640, 512]), torch.Size([3, 832, 512]), torch.Size([3, 512, 640]), torch.Size([3, 832, 1024]), torch.Size([3, 576, 1024]), torch.Size([3, 384, 832]), torch.Size([3, 768, 768]), torch.Size([3, 512, 832]), torch.Size([3, 1024, 512]), torch.Size([3, 704, 1024]), torch.Size([3, 512, 960]), torch.Size([3, 512, 1088]), torch.Size([3, 832, 832]), torch.Size([3, 512, 1472]), torch.Size([3, 576, 576]), torch.Size([3, 768, 1024]), torch.Size([3, 1024, 704]), torch.Size([3, 768, 1280]), torch.Size([3, 1024, 768]), torch.Size([3, 1024, 832]), torch.Size([3, 832, 640])]

Training logs: https://wandb.ai/aryanvs/text2image-fine-tune/runs/ogmrnvve

Hi, where can I see your codes? I can give you some details.

Mar 20 '24 13:03 jiaxiangc

@a-r-r-o-w Maybe You can open group norm. Finetune on general model, such as SD1.5. Make sure your dataset resolutions are right.

Best.

Mar 20 '24 13:03 jiaxiangc

I think I've been able to somewhat replicate the effect of ResAdapter on my last training run of 50k images. The problem I'm facing is that it is very difficult to train on a large amount of images (> 100k) on a A6000/4090 instance for 1 or 2 epochs. But otherwise, I believe the script is ready and only introduces minimal changes.

Mar 20 '24 20:03 a-r-r-o-w

I think before that it needs to be moved to the research_projects folder?

Mar 21 '24 02:03 sayakpaul

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Mar 21 '24 02:03 HuggingFaceDocBuilderDev

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 26 '24 15:04 github-actions[bot]

@a-r-r-o-w is this good to be merged?

Jun 30 '24 05:06 sayakpaul

@a-r-r-o-w is this good to be merged?

@sayakpaul I think it should be good to merge now. However, I would like to note that I haven't fully reproduced the paper results because of limited compute time for this, but the results are clearly noticeable in 1-2 epochs (50k images). One can see structure improve and repetition for large height/width decrease over time. The authors train on much much more data, which is infeasible for me at the moment.

Jun 30 '24 12:06 a-r-r-o-w

I also did a quick training run for 1 epoch to verify the latest changes that just completed. The results are for runwayml/stable-diffusion-v1-5 on first 50k images of LAION-coco-aesthetic.

Before training	1'st epoch

Unfortunately, I can't share training logs or checkpoints due to reasons. Would love to see people from the community pick it up if there's interest.

Jun 30 '24 12:06 a-r-r-o-w

@sayakpaul hey, addressed the previous comments. could you review again? thanks!

also the old failing tests seemed unrelated

Jul 01 '24 06:07 a-r-r-o-w

failing tests seem to be unrelated so far i think

Jul 05 '24 06:07 a-r-r-o-w

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sep 14 '24 15:09 github-actions[bot]

[Community] ResAdapter training script

What does this PR do?

Before submitting

Who can review?