diffusers
diffusers copied to clipboard
Add aspect ratio bucketing to training scripts
Is your feature request related to a problem? Please describe. When fine tuning SDXL, images are required to be a fixed size (1024x1024) which involves a lot of cropping that both takes time/resources, and often causes important parts of the image to get cropped out, which lowers model quality.
Describe the solution you'd like.
The ideal solution would be a simple option for user to enable aspect ratio bucketing (e.g. a command argument --enable-bucketing
) that will let them train with multiple image sizes
that's really not something that can be added to these scripts without totally rewriting it.
it is the goal of https://github.com/bghira/simpletuner to provide a Diffusers-centric training toolkit that implements aspect bucketing and other optimisations, including data bucketing, pure-bf16 training, multi-gpu support, pre-training embed caching, and more.
Thanks! I will take a look at your source code for simpletuner, and see if that helps me understand how to do it. I'm still trying to wrap my head around the concepts surrounding bucketing and size/cropping considerations during training.
It is my understanding that aspect ratio bucketing / size conditioning were at the core of how SDXL was trained in the first place. In the SDXL paper, they say:
"Real-world datasets include images of widely varying sizes and aspect-ratios While the common output resolutions for text-to-image models are square images of 512 x 512 or 1024 x 1024 pixels, we argue that this is a rather unnatural choice, given the widespread distribution and use of landscape (e.g., 16:9) or portrait format screens. Motivated by this, we finetune our model to handle multiple aspect-ratios simultaneously: We follow common practice and partition the data into buckets of different aspect ratios, where we keep the pixel count as close to 1024² pixels as possibly, varying height and width accordingly in multiples of 64. [...] During optimization, a training batch is composed of images from the same bucket, and we alternate between bucket sizes for each training step. Additionally, the model receives the bucket size (or, target size) as a conditioning, represented as a tuple of integers C_ar=(h,w) which are embedded into a Fourier space in analogy to the size- and crop-conditionings described above. In practice, we apply multi-aspect training as a finetuning stage after pretraining the model at a fixed aspect-ratio and resolution and combine it with the conditioning techniques"
... so I am very surprised if there is really no way to easily do this with diffusers training for SDXL.
If it's not possible, then I think that incorporating easy aspect ratio bucketing into diffusers would be a huge benefit to users of the library. It would make dataset management massively easier for anyone who has mixed size/resolution images they want to train SDXL on, and would improve model quality by removing noise introduced from cropping errors.
I would be interested to explore what exactly would need to be changed to make this possible, because it seems to me like kind of a core feature needed to work with SDXL, and a lot of model quality/flexibility is lost by forcing users to crop images into squares.
it's been asked for (by me, even) but the consensus currently is that the example training scripts are just that - examples, and they can be forked and extended to add these features
the problem with aspect bucketing is that it's not trivial. images have to be the same size in a single batch, and for typical (eg. single subject dreambooth) finetuning on downstream tasks, the aspect buckets just aren't that important - especially for SDXL, which has additional microconditioning inputs at inference time that specify the aspect ratios you want.
for very large training tasks where aspect bucketing makes sense, then you begin to run into scale issues which the example script is not designed for.
- it will have to actually read in all of the images in order to generate an aspect bucket list which is fine for small datasets, but you don't need it for those.
- it will have to either store this somewhere or re-do it upon startup for every resumed training run
- once you get aspect bucketing for the training loop going, you'll almost invariably want to add batching to the vae embed pre-cache task, and then the text encoder inputs. because these will slow down the pre-processing of very large training runs.
- all of those caches have to be stored to disk somewhere, and kept track of. you'll have to scan at startup to find the differences. if any new images arrived, we have to check that they are scanned.
- any stored objects on disk will have to have their random crop coordinate augmentation values saved somewhere so that they can be reused effectively
it's a hard problem and at this point i understand why it's not yet solved in the example training scripts. but it doesn't exclude a future project from the team that would essentially create a transformers-like Trainer module which can do these kinds of data pipeline tasks efficiently and reliably.
I really appreciate you taking the time to explain the reasoning behind this. There is not a lot of information about how aspect ratio bucketing works available online, so it is hard to understand what challenges one would encounter with it. I would be interested in contributing however I can to helping write training scripts that incorporate a simple bucketing scheme.
I feel like some of the issues you mentioned could be mitigated by just operating on the assumption that the user will not modify the training set between runs (making this clear in documentation/comments), and letting the users deal with caching, etc.
The bucketing script would ONLY deal with the bucketing process - simply scanning through metadata to get image resolutions, and placing into fixed set of buckets based on the ones that were originally used to train SDXL - i.e.:
).
So in short, what if we made the following assumptions:
(a) the user cannot modify training set after buckets are assigned, without breakage (b) the user handles any caching/state-saving logic on their own, unless they want to redo it with each run (c) we just use the bucket sizes that were originally used to train SDXL instead of having complex code to determine dataset-specific bucket classes
Would that not make it much simpler to implement a basic bucketing scheme and address your concerns about complexity of handling caching, etc?
You are right that this would still mean running through the entire dataset for each training run, but I don't think it would be that costly to just go through image metadata to extract dimensions. For instance, right now I have a ~750,000 image dataset that I'm trying to use as training data for a full SDXL fine tune. The time that it will take me to scan through and get image dimensions and sort these into buckets will be no more than the cost of having to go through and process all 750k images to convert them into 1024x1024 squares. And if the user wants to, the result of this bucketing process could be easily stored on disk (mapping of files to buckets, cropping/scaling info, etc), with the understanding that if they modify the training dataset afterwards, that the buckets would have to be regenerated.
I am new to this, and am aware that I might be missing something, or that something you're saying might be going over my head. I am just trying to understand the situation better, so that I can hopefully contribute to writing some code that might be helpful to others like myself who just want a very basic bucketing scheme available.
all of the stuff you describe is already in kohya trainer or simpletuner, and i promise you it's really not something the diffusers project is currently interested in working on. @patil-suraj and @sayakpaul can elaborate
all of the assumptions you want to make end up being really difficult to work with. i know this because simpletuner has options to preserve caches and all of that
the square crops can be generated on-the-fly and you don't have to scan the whole dataset to know the true image sizes 🤷 because they are all the same aspect ratio, 1.0
Thanks! I will take a look at your source code for simpletuner, and see if that helps me understand how to do it. I'm still trying to wrap my head around the concepts surrounding bucketing and size/cropping considerations during training.
It is my understanding that aspect ratio bucketing / size conditioning were at the core of how SDXL was trained in the first place. In the SDXL paper, they say:
"Real-world datasets include images of widely varying sizes and aspect-ratios While the common output resolutions for text-to-image models are square images of 512 x 512 or 1024 x 1024 pixels, we argue that this is a rather unnatural choice, given the widespread distribution and use of landscape (e.g., 16:9) or portrait format screens. Motivated by this, we finetune our model to handle multiple aspect-ratios simultaneously: We follow common practice and partition the data into buckets of different aspect ratios, where we keep the pixel count as close to 1024² pixels as possibly, varying height and width accordingly in multiples of 64. [...] During optimization, a training batch is composed of images from the same bucket, and we alternate between bucket sizes for each training step. Additionally, the model receives the bucket size (or, target size) as a conditioning, represented as a tuple of integers C_ar=(h,w) which are embedded into a Fourier space in analogy to the size- and crop-conditionings described above. In practice, we apply multi-aspect training as a finetuning stage after pretraining the model at a fixed aspect-ratio and resolution and combine it with the conditioning techniques"
... so I am very surprised if there is really no way to easily do this with diffusers training for SDXL.
If it's not possible, then I think that incorporating easy aspect ratio bucketing into diffusers would be a huge benefit to users of the library. It would make dataset management massively easier for anyone who has mixed size/resolution images they want to train SDXL on, and would improve model quality by removing noise introduced from cropping errors.
I would be interested to explore what exactly would need to be changed to make this possible, because it seems to me like kind of a core feature needed to work with SDXL, and a lot of model quality/flexibility is lost by forcing users to crop images into squares.
Have you made any progress in the buckets of data? Looking forward to learning how to achieve this complex work
I'm curious about why do you want to do this instead of using already done solutions like SimpleTuner
or sd-scripts
, and if you want a gui you can use OneTrainer
.
If you want to learn how-to but without doing all the complex stuff a.k.a. only for small datasets, the only part you need to add to the script is:
- Add buckets with all the aspect ratios you want.
- Add the images to the corresponding buckets.
- Code a sampler to get the images of each bucket.
- Deal when the last batch is not full.
But this is only useful for learning and for small datasets like for a LoRA but aspect ratios don't do much in this case. For big datasets you need to do a lot more than this as @bghira suggested.
If you look at the code of the other trainers you can see that the code is a lot bigger and more complex than the training scripts here, doing the same here would defeat the purpose of the training scripts to be a learning material and easy to understand.
I'm curious about why do you want to do this instead of using already done solutions like SimpleTuner or sd-scripts, and if you want a gui you can use OneTrainer.
Because I don't want to package an full-featured training toolkit in with my application when all I need is a very small subset of the functionality (I literally just need what's already available in diffusers + aspect ratio bucketing). Also, my application requires the flexibility of writing custom diffusers training code. These kinds of reasons are exactly why diffusers has training functions available, after all.
Most of the people who are choosing to write their own SDXL training code with diffusers will need aspect ratio bucketing. Generally, when most of the users of a library will need to implement the same code whenever they use the library, that code is a good candidate for inclusion in the library itself.
Thanks for your explanation, I appreciate it and it helps me understand and take note for the future.
As a side note, so that you know, this is your specific requirement, we got some other people with their own, like how to train with really large datasets, use attention masking, use 1D images, encode the images beforehand among other things, so there's quite a lot of people that has a diffusers + "their own requirement" feature request.
I'm doing a training example for myself (as a learning experience) whenever I have time and I'm adding bucketing but for very specific resolutions, when I finish it I intend to share it and maybe write a blog about it. This probably will take a while to finish though as I have other priorities right now.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.