sd-scripts
sd-scripts copied to clipboard
SDXL 1024 training - images assigned to wrong buckets
I'm training a SDXL Lora and I don't understand why some of my images end up in the 960x960 bucket. Shouldn't the square and square like images go to the 1024x1024 bucket, provided the img resolution is high enough? This might be a problem with the script or perhaps I'm misunderstanding how images are assigned to buckets.... Can anyone shed some light on this?
I've got 27 high-res images (see below). As you can see there is only a single image with width < 1024, but it's 988x1756 so it should go to one of the tall portrait buckets, right?
And this is how the buckets are reported (img repeat is set to 5), so 2 images are assigned to the 960x960 bucket:
bucket 0: resolution (768, 1216), count: 5 bucket 1: resolution (768, 1344), count: 5 bucket 2: resolution (832, 1088), count: 5 bucket 3: resolution (896, 1024), count: 10 bucket 4: resolution (896, 1088), count: 5 bucket 5: resolution (960, 960), count: 10 bucket 6: resolution (960, 1024), count: 20 bucket 7: resolution (960, 1088), count: 10 bucket 8: resolution (1024, 896), count: 5 bucket 9: resolution (1024, 960), count: 20 bucket 10: resolution (1088, 832), count: 5 bucket 11: resolution (1152, 832), count: 10 bucket 12: resolution (1152, 896), count: 10 bucket 13: resolution (1216, 832), count: 5 bucket 14: resolution (1280, 768), count: 5 bucket 15: resolution (1344, 768), count: 5
Below is my training command. The training resolution is 1024x1024, buckets are enabled, bucket upscale is disabled, bucket resolution is 64.
accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=512 --max_bucket_reso=2048 --pretrained_model_name_or_path="E:/Automatic1111/stable-diffusion-webui/models/Stable-diffusio n/sd/sdXL_v10VAEFix.safetensors" --train_data_dir="E:/Automatic1111/datasets/mona/train_v2" --resolution="1024,1024" --output_dir="E:/Automatic1111/datasets/mona/output" --logging_dir="E:/Automatic1111/datasets/mona/logs" --network_alpha="1" --training_comment="mona" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0003 --unet_lr=0.0003 --network_dim=32 --output_name="sdxl_mona3" --lr_scheduler_num_cycles="10" --cache_text_encoder_outputs --no_half_vae --full_bf16 --learning_rate="0.00035" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="1350" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --caption_extension="txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --flip_aug --xformers --bucket_no_upscale --noise_offset=0.0357 --network_train_unet_only --sample_sampler=k_dpm_2 --sample_prompts="E:/Automatic1111/datasets/mona/output\sample\prompt.txt" --sample_every_n_epochs="1"
I'm using kohya_gui on windows. I have already posted this issue a week ago there, but I didn't receive any answer, except someone else reporting similar issues. I believe kohya gui was synchronized with kohya scripts two days ago and the issue is still there.
Also, when testing on a larger dataset with 160+ images, and the same bucket/resolution settings, I get proper assignment and the square bucket is 1024x1024. I have no idea what makes this smaller dataset create such small buckets. Other than the buckets being wrong, everything else seems to be working fine but of course I'm worried that I'm training in suboptimal resolution.
Thank you for reporting this. May be it is a bug. I will make investigations.
Thank you. If you need a dataset to reproduce this issue, I have coverted all 27 images to black, stripped the metadata and uploaded them to onedrive: https://1drv.ms/f/s!AmsXKhKEv4eUmds-fs8XFy8NBVnkEw?e=GCsdzE They are less than 1MB in total as they are all black pngs.
Thank you for the dataset! It really helps me.
This is an intended behavior currently. Because the images are slightly out of square (e.g. 5874x6142), the images are sorted to a 960x960 bucket.
The current process first resizes the image so that its area (width*height) is below the specified resolution while maintaining the aspect ratio. Then the number of pixels on the short side is rounded to bucket_reso_steps
units. As a result, the short side bucket size is 960.
The size of the long side is then calculated from the aspect ratio and rounded to bucket_reso_steps
units as well, which is also 960.
Unfortunately, at the moment I don't know how to calculate to make this bucket 1024x1024 while keeping the proper area (height*width <= resolution[0]*resolution[1]) for each bucket.
So, although it is not a fundamental solution, it may be a good idea to set the resolution
to 1024,1080
for now.
Thank you for the explanation. I will use the workaround, but it seems that currently the safest way is to prepare the dataset with images that match the bucket aspect ratios exactly or they might end up in too small buckets.
I think you could use a simpler algorithm to fix this issue:
- Generate all possible bucket sizes you can use for training (maybe respecting min_bucketsize, max_bucketsize, bucket_reso_steps)
- For each image find the best bucket based only on aspect ratio (select bucket with aspect ratio most similar to the aspect ratio of the image) a. Skip buckets that are bigger than the image in any dimension unless bucket upscaling is enabled b. If two or more buckets have the same aspect ratio, use the bucket with bigger area
- Finally, rescale all images to their corresponding buckets and crop what is outstanding
I know you also have the random_crop option, which probably needs the image to have some margin in both axes when doing the final crop, but for static center crop the algorithm above should assign the images to the best possible bucket.
Thanks you! I agree with the safest way.
And the algorithm will be helpful. However, the min/max bucket reso is ignored when no upscale is specified, so the possible resolutions increase significantly (in principle 64x16384 to 16384x64). But I will try to think of a better way with reference to the algorithm.
Well, you could always NOT ignore min/max bucket reso :)
Anyway, I hope you can figure it out. I think a lot of people might not realize that they are training in suboptimal resolutions. I saw many tutorials on youtube that say you can crop your images to any resolution you like and the bucketing algo will take care of the rest.
I think I will probably write a script that will crop my images prior to training. In fact, now that I think of it, you might do the same - crop all images to the nearest bucket aspect ratio as a pre-process step. So, if you have a problematic image like 1050x1080 that gets assigned to 960x960 bucket, it would first be cropped to 1050x1050 and such image should be correctly assigned to the 1024x1024 bucket according to current bucketing algo. Just another idea for consideration...
Thank you for another idea!
In fact, if bucket_no_upscale
option is not specified, it will be assigned to the appropriate bucket, so the impact on users will be minimal.
make buckets
number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む)
bucket 0: resolution (768, 1344), count: 1
bucket 1: resolution (832, 1216), count: 1
bucket 2: resolution (896, 1152), count: 2
bucket 3: resolution (960, 1088), count: 7
bucket 4: resolution (1024, 1024), count: 5
bucket 5: resolution (1088, 960), count: 3
bucket 6: resolution (1152, 896), count: 5
bucket 7: resolution (1216, 832), count: 1
bucket 8: resolution (1280, 768), count: 1
bucket 9: resolution (1344, 768), count: 1
mean ar error (without repeats): 0.030601507204034612
However, as you wrote, if bucket_no_upscape
is specified, it is not appropriate and I will try to improve it.
Thank you, very good to know that.
BTW, is there a way to log which images get assigned to which bucket? I often get a situation where I end with a single image assigned to bucket and it's hard for me to find out which one it is since all the resolutions are different. I was told that having num images in a bucket less than batch_size is problematic. Is this true?
Well, you could always NOT ignore min/max bucket reso :)
Anyway, I hope you can figure it out. I think a lot of people might not realize that they are training in suboptimal resolutions. I saw many tutorials on youtube that say you can crop your images to any resolution you like and the bucketing algo will take care of the rest.
I think I will probably write a script that will crop my images prior to training. In fact, now that I think of it, you might do the same - crop all images to the nearest bucket aspect ratio as a pre-process step. So, if you have a problematic image like 1050x1080 that gets assigned to 960x960 bucket, it would first be cropped to 1050x1050 and such image should be correctly assigned to the 1024x1024 bucket according to current bucketing algo. Just another idea for consideration...
on the contrary i am specifically telling people to use exact resolution on my tutorials :)
by the way in one of my subscriber the bucketing system was causing error and he was not able to train
after manually cropping all images it started working
@FurkanGozukara I've seen your tutorials - thanks for making them and educating the community :)
The other benefit of doing cropping and downscaling on your own is that you can use a high quality downsampling algorithm that produces sharp results for best possible training. Photoshop has good image size reduction when using bicubic (sharper), but in most other programs I often have to apply a sharpening filter after downsacaling because they produce results on the blurry side.
I have no idea what kind of downsampling filter is used in kohya scripts, I'm guessing it's bilinear?
BTW, is there a way to log which images get assigned to which bucket? I often get a situation where I end with a single image assigned to bucket and it's hard for me to find out which one it is since all the resolutions are different. I was told that having num images in a bucket less than batch_size is problematic. Is this true?
Unfortunately, there is no way to log/know for assigning currently. If some batch has num of images less than the batch size, it slows down the training a bit, but I don't think it make the result much worse.
The downsampling filter is INTER_AREA in OpenCV. I believe it is better than LANCZOS4 or bilinear, cubic etc.
The downsampling filter is INTER_AREA in OpenCV. I believe it is better than LANCZOS4 or bilinear, cubic etc.
Thanks, this is very good to know. From what I found, INTER_AREA seems to be the most blurry of the available filters as it is designed to reduce moire effects. Here is the comparison I found: https://gist.github.com/georgeblck/e3e0274d725c858ba98b1c36c14e2835
If this stops trainng then this info should be on main page
I think this doesn't stop the training. This only affects when '--no_bucket_upscale` option is specified, and even if the option is specified, it affects some of images of the dataset, and may slightly worse the training result.
Hi @kohya-ss and all! I've created a script for resizing and cropping images for bucketing. Given the discussion here, I believe it can be useful as a quick pre-processing step. Note that my defaults are the same as SDXL (but is possible to change them). Let me know if it makes sense: Resize and Crop Images for Bucketing.
@kohya-ss have you considered something like this?
def calculate_new_size_by_pixel_area(aspect_ratio: float, megapixels: float):
if type(aspect_ratio) != float:
raise ValueError(f"Aspect ratio must be a float, not {type(aspect_ratio)}")
total_pixels = max(megapixels * 1e6, 1e6)
W_initial = int(round((total_pixels * aspect_ratio) ** 0.5))
H_initial = int(round((total_pixels / aspect_ratio) ** 0.5))
# Ensure divisibility by 64 for both dimensions with minimal adjustment
def adjust_for_divisibility(n):
return (n + 63) // 64 * 64
W_adjusted = adjust_for_divisibility(W_initial)
H_adjusted = adjust_for_divisibility(H_initial)
# Ensure the adjusted dimensions meet the megapixel requirement
while W_adjusted * H_adjusted < total_pixels:
W_adjusted += 64
H_adjusted = adjust_for_divisibility(int(round(W_adjusted / aspect_ratio)))
return (
W_adjusted,
H_adjusted,
MultiaspectImage.calculate_image_aspect_ratio((W_adjusted, H_adjusted)),
)
eg. only use the original (rounded) aspect ratio as the input to resize by area?
I have read the discussion and I have a question in regards to "max resolution" and bucket sizes. I am training SDXL (now also Flux) with 1024x1024 as max resolution. I have three bucket sizes: 1024x1024, 1152x896 and 1216x832. Do I keep "max resolution" to 1024x1024 as this is the most optimum training setting? Or do I increase the max resolution size to highest bucket resolution: 1216x1216? Or perhaps downscale buckets 1216x832 and 1152x896 to 1024x704 and 1024x768 (with a crop) respectively?
I have read the discussion and I have a question in regards to "max resolution" and bucket sizes. I am training SDXL (now also Flux) with 1024x1024 as max resolution. I have three bucket sizes: 1024x1024, 1152x896 and 1216x832. Do I keep "max resolution" to 1024x1024 as this is the most optimum training setting? Or do I increase the max resolution size to highest bucket resolution: 1216x1216? Or perhaps downscale buckets 1216x832 and 1152x896 to 1024x704 and 1024x768 (with a crop) respectively?
i think it downscales according to the lowest size
so when you set 1024 since it is bigger than 832, 896, 704 and 768 it won't downscale
lets also get verification of @kohya-ss - this is also something i wonder