[Feat]: Print out filenames of the images that get dropped out of a bucket?
Describe your use-case.
A user in the discord request a small change to print to console which images get dropped as a result of not being able to fill a bucket.
What would you like to see as a solution?
Console printing of the filenames
Have you considered alternatives? List them here.
Large piece of work but #267
Rather than dropping samples, I've modified my local copy of mgds to repeat to fill each bucket. I chose this route because if there were fewer samples than batch_size in a bucket, that bucket effectively gets entirely dropped (and those samples don't get trained at all!)
This does mean that those samples get repeated more often than other samples in the dataset, so caveat emptor. A proper solution would probably warn about buckets which can't be evenly filled with a hint to balance your dataset a bit better.
# drop images for full buckets
for bucket_key in bucket_dict.keys():
- samples = bucket_dict[bucket_key]
- samples_to_drop = len(samples) % self.batch_size
- for i in range(samples_to_drop):
- # print('dropping sample from bucket ' + str(bucket_key))
- samples.pop()
+ l = math.ceil(len(bucket_dict[bucket_key]) / self.batch_size) * self.batch_size
+ bucket_dict[bucket_key] = (bucket_dict[bucket_key] * self.batch_size)[:l]
@cheald I was searching precisely about this issue and I will try and implement what you did here.
But honestly, I understand why they'd choose to drop the images rather than duplicate them - some people may not want or need this.
But what I cannot understand is why you will not be notified of this. If there's something I can't really stomach is software that will change the expected behavior silently. Printing out which image is getting dropped, or at least the fact that some images are getting dropped, goes a long way rather than finding that suddenly I have 14 fewer images than I thought
Incidentally I believe I've tracked down the bucketing algorithm to AspectBucketing in mgds.
It might be interesting to "reverse engineer" it to find out which images will go into buckets that won't be filled for a particular BS.
however, my knowledge of the whole thing isn't good enough to be able to replicate this. I'm unsure what information is being passed here.
I suppose resolution_in_name may be something like the resolution of the source image, maybe 1024 x 1152 or maybe whatever it gets converted to when converted to the latent space, because I've no idea how they change.
I suppose again target resolution is what you type in onetrainer, normally 1024 for SDXL.
But beyond that, there's just too much I don't know and I'm not a python programmer. If this was C#, what I use professionally, I know my way around and I could very easily debug the whole thing. But with python I have a lot of difficulty, starting with the fact the mgds library doesn't seem to be even found in the venv that python is running, or anywhere else.
What I know is I find it very jarring when the software drops buckets without informing or giving information that will allow you to take action - just knowing which images were in the bucket would go a long, long way to, you know, changing their aspect.
And having some kind of script that, given the concept directory, will estimate which buckets will be created and whether any images will need to be dropped would also help a lot
Edit: from what I see, target resolution in one trainer seems to come directly from the settings, probably from the config as given by the GUI, so I'm gonna assume it's 1024 (at least in my SDXL case which I don't change)
In DataLoaderText2ImageMixin I see CalcAspect is being used, and from my understanding it will query the property 'original_resolution' which I saw was part of some batch dictionary-like structure. And what I think is being returned is a tuple with the dimensions.
And this seems to be what AspectBucketing uses. However I'm unsure if I'm capable of reproducing it into some kind of script to actually sieve through my images.
It's very vexing to waste time on curating a training set only to find a huge chunk of it being dropped silently because I tried to use BS 4
I asked copilot to reverse engineer it in a more simple way, at least avoiding the more 'exotic' options such as resolution override, etc.
Considering the dropping of the last batch as per pytorch's behavior (), the numbers DO match with what I've seen today during my training (first with BS = 4, then BS = 2). No idea how accurate it is, but this is as much as I can have copilot do.
import math
import numpy as np
import os
import glob
import argparse
from PIL import Image
def simple_aspect_bucketing(image_paths, target_resolution=1024, batch_size=4, quantization=8):
"""
Groups images into buckets based on aspect ratio for efficient batch processing.
Args:
image_paths: List of paths to images
target_resolution: Target resolution dimension (e.g., 1024 for 1024x1024)
batch_size: Size of batches for training
quantization: Value to align dimensions to (e.g., 8 for dimensions divisible by 8)
Returns:
dict: Mapping of bucket_resolution -> list of images
list: Images that were dropped (couldn't form complete batches)
"""
# Define standard aspect ratios (same as in the original implementation)
standard_aspects = [
(1.0, 1.0), # 1:1
(1.0, 1.25), # 4:5
(1.0, 1.5), # 2:3
(1.0, 1.75), # 4:7
(1.0, 2.0), # 1:2
(1.0, 2.5), # 2:5
(1.0, 3.0), # 1:3
(1.0, 3.5), # 2:7
(1.0, 4.0), # 1:4
]
# Create normalized bucket resolutions from the standard aspect ratios
bucket_resolutions = []
for h, w in standard_aspects:
# For each aspect ratio, calculate dimensions with similar pixel area to target_resolution²
norm_h = int(round(h / math.sqrt(h * w) * target_resolution / quantization) * quantization)
norm_w = int(round(w / math.sqrt(h * w) * target_resolution / quantization) * quantization)
bucket_resolutions.append((norm_h, norm_w))
# Add the flipped version too (portrait vs landscape)
bucket_resolutions.append((norm_w, norm_h))
# Remove duplicates
bucket_resolutions = list(set(bucket_resolutions))
# Calculate aspect ratio for each bucket
bucket_aspects = [h/w for h, w in bucket_resolutions]
bucket_aspects = np.array(bucket_aspects)
# Group images by their bucket
buckets = {res: [] for res in bucket_resolutions}
# Process each image and assign to a bucket
for img_path in image_paths:
try:
with Image.open(img_path) as img:
w, h = img.size
aspect = h / w
# Find the closest matching bucket
bucket_idx = np.argmin(abs(bucket_aspects - aspect))
target_res = bucket_resolutions[bucket_idx]
# Add to the appropriate bucket
buckets[target_res].append(img_path)
except Exception as e:
print(f"Error processing {img_path}: {e}")
# Identify completed buckets and dropped images
completed_buckets = {}
dropped_images = []
for bucket_res, images in buckets.items():
# Calculate how many complete batches we can form
complete_batch_count = len(images) // batch_size
if complete_batch_count > 0:
# Keep images that form complete batches
completed_buckets[bucket_res] = images[:complete_batch_count * batch_size]
# Track dropped images (those that don't form complete batches)
if len(images) % batch_size != 0:
dropped_images.extend(images[complete_batch_count * batch_size:])
else:
# All images in this bucket are dropped
dropped_images.extend(images)
return completed_buckets, dropped_images
def main():
"""
Command-line interface to run the aspect bucketing on a directory of images.
"""
parser = argparse.ArgumentParser(description="Group images into aspect ratio buckets")
parser.add_argument("directory", help="Directory containing images to process")
parser.add_argument("--batch_size", type=int, default=4,
help="Batch size for training (default: 4)")
parser.add_argument("--target_resolution", type=int, default=1024,
help="Target resolution dimension (default: 1024)")
parser.add_argument("--quantization", type=int, default=8,
help="Value to align dimensions to (default: 8)")
args = parser.parse_args()
# Find all images in the directory
image_extensions = ['*.jpg', '*.jpeg', '*.png']
image_paths = []
for ext in image_extensions:
pattern = os.path.join(args.directory, '**', ext)
image_paths.extend(glob.glob(pattern, recursive=True))
print(f"Found {len(image_paths)} images in {args.directory}")
# Process images
buckets, dropped_images = simple_aspect_bucketing(
image_paths,
target_resolution=args.target_resolution,
batch_size=args.batch_size,
quantization=args.quantization
)
# Print results
print(f"\nResults with batch size {args.batch_size}:")
print(f"Images successfully bucketed: {sum(len(imgs) for imgs in buckets.values())}")
print(f"Images dropped: {len(dropped_images)}")
if dropped_images:
print("\nDropped images:")
max_display = 20 # Maximum number to display in console
for i, img_path in enumerate(dropped_images):
if i < max_display:
print(f" {img_path}")
elif i == max_display:
print(f" ... and {len(dropped_images) - max_display} more")
break
print("\nBucket statistics:")
for resolution, images in sorted(buckets.items(), key=lambda x: len(x[1]), reverse=True):
h, w = resolution
batch_count = len(images) // args.batch_size
print(f" {h}x{w} (aspect {h/w:.2f}): {len(images)} images ({batch_count} batches)")
if __name__ == "__main__":
main()
I did something fairly similar to this with the "concept statistics" tab, though that's not identifying the specific filenames that are in each bucket, and doesn't account for repeats/samples. With 1 repeat it should be accurate, and any integer multiples would be proportional, but samples or non-integer repeats it's going to be somewhat random which images end up in an epoch and thus which buckets are filled. Also multiple concepts may change how that works. Feel free to copy anything in that script or improve it, I'm probably going to work on some optimizations for that soon.
I personally haven't found there to be much of a speed improvement with increasing the batch size above 2, but that may depend on your GPU. Gradient accumulation should do the same thing as a large batch size but without the aspect ratio limitation. Also I think images at different resolutions will cause similar issues to the aspect ratio in terms of which images can be batched together, if training with multiple resolutions.
I think we're not going to implement "printing out filenames of images that get dropped". That's just too verbose for people who are not interested. The underlying topic is valid, but that's already adressed here: https://github.com/Nerogar/OneTrainer/issues/267
Unless anyone strongly disagrees I'd close
No one else has followed up with reasons (nor has the original user that I opened this for) as such, closing.