Alternative batching behavior for mix-sized training
Continuing #6620
In this version, the dataset is batched almost like there is no bucketing. Internally, a batch is replaced with a superbatch consisting of one or more batches.
I don't have the hardware to test the performance of this but I expect a slight regression. If needed I can add back the previous behavior as an option.
So I guess as the first tester. So this is my dataset running at batch 8.
Buckets:
384x768: 1
448x768: 1
512x768: 153
768x512: 12
So Currently in my quick testing I don't see any noticeable performance drop (it/s look the same) for this dataset. Cuda memory usage looks the same as well.
Edit: So I'm trying out a very normal dataset that was not ran through your auto crop script.
Buckets:
329x768: 1
370x768: 1
432x768: 1
496x768: 2
501x768: 2
502x768: 1
505x768: 1
506x768: 1
508x768: 1
509x768: 2
510x768: 1
512x768: 1
513x768: 1
515x768: 1
516x768: 3
517x768: 1
518x768: 3
519x768: 5
520x768: 2
521x768: 2
522x768: 6
523x768: 2
524x768: 1
525x768: 2
526x768: 14
527x768: 1
528x768: 1
529x768: 5
530x768: 6
531x768: 4
532x768: 2
533x768: 1
535x768: 7
536x768: 6
537x768: 10
538x768: 10
539x768: 14
540x768: 9
541x768: 6
542x768: 4
543x768: 7
544x768: 3
546x768: 1
554x768: 1
768x536: 3
768x538: 1
768x539: 2
768x540: 1
768x542: 2
768x544: 1
768x545: 2
And I see my GPU usage fluctuate as expected.
Everything's playing with LoRAs and they don't see this lol.
What's the practical difference between this and using batch size 1 with gradient accumulation
Say you have bucket sizes a: 3, b: 4, c: 5 and batch_size=2. This does [a, a], [b, b], [b, b], [c, c], [c, c], [a, c], where [a c] issues 2 batches. Previously the [a c] (super)batch is chosen randomly from [a, a], [c, c]. If you just do batch_size=1 then [a, a], [b, b] etc. wouldn't be batched.
This is a lot of new code. I really don't want to merge it in unless there is a visible benefit.
It's ~30 lines added. Only providing this method because the previous one is thought to be confusing. This is a speed reduction in some cases as reported above. In terms of quality, we need more evidence. I agree that if there's no visible improvement then the simpler but possibly more confusing method is better, as I commented before .
If the amount of code change is a concern, greedy_pack() can also be cut for simplicity.
Independent documented no-side-effects functions I'm completely fine with, it's changes to existing lines that scare me because I have to go through them all and understand what they change.
Oh, that's mostly indention. Just looks scary on GitHub; VSCode actually displays it nicely.