stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

Alternative batching behavior for mix-sized training

Open guaneec opened this issue 2 years ago • 8 comments

Continuing #6620

In this version, the dataset is batched almost like there is no bucketing. Internally, a batch is replaced with a superbatch consisting of one or more batches.

I don't have the hardware to test the performance of this but I expect a slight regression. If needed I can add back the previous behavior as an option.

guaneec avatar Jan 20 '23 06:01 guaneec

So I guess as the first tester. So this is my dataset running at batch 8.

Buckets:
  384x768: 1
  448x768: 1
  512x768: 153
  768x512: 12

So Currently in my quick testing I don't see any noticeable performance drop (it/s look the same) for this dataset. Cuda memory usage looks the same as well.

Edit: So I'm trying out a very normal dataset that was not ran through your auto crop script.

 Buckets:
  329x768: 1
  370x768: 1
  432x768: 1
  496x768: 2
  501x768: 2
  502x768: 1
  505x768: 1
  506x768: 1
  508x768: 1
  509x768: 2
  510x768: 1
  512x768: 1
  513x768: 1
  515x768: 1
  516x768: 3
  517x768: 1
  518x768: 3
  519x768: 5
  520x768: 2
  521x768: 2
  522x768: 6
  523x768: 2
  524x768: 1
  525x768: 2
  526x768: 14
  527x768: 1
  528x768: 1
  529x768: 5
  530x768: 6
  531x768: 4
  532x768: 2
  533x768: 1
  535x768: 7
  536x768: 6
  537x768: 10
  538x768: 10
  539x768: 14
  540x768: 9
  541x768: 6
  542x768: 4
  543x768: 7
  544x768: 3
  546x768: 1
  554x768: 1
  768x536: 3
  768x538: 1
  768x539: 2
  768x540: 1
  768x542: 2
  768x544: 1
  768x545: 2

And I see my GPU usage fluctuate as expected.

USBhost avatar Jan 20 '23 15:01 USBhost

Everything's playing with LoRAs and they don't see this lol.

USBhost avatar Jan 23 '23 01:01 USBhost

What's the practical difference between this and using batch size 1 with gradient accumulation

AUTOMATIC1111 avatar Jan 23 '23 12:01 AUTOMATIC1111

Say you have bucket sizes a: 3, b: 4, c: 5 and batch_size=2. This does [a, a], [b, b], [b, b], [c, c], [c, c], [a, c], where [a c] issues 2 batches. Previously the [a c] (super)batch is chosen randomly from [a, a], [c, c]. If you just do batch_size=1 then [a, a], [b, b] etc. wouldn't be batched.

guaneec avatar Jan 23 '23 12:01 guaneec

This is a lot of new code. I really don't want to merge it in unless there is a visible benefit.

AUTOMATIC1111 avatar Jan 23 '23 14:01 AUTOMATIC1111

It's ~30 lines added. Only providing this method because the previous one is thought to be confusing. This is a speed reduction in some cases as reported above. In terms of quality, we need more evidence. I agree that if there's no visible improvement then the simpler but possibly more confusing method is better, as I commented before .

If the amount of code change is a concern, greedy_pack() can also be cut for simplicity.

guaneec avatar Jan 23 '23 15:01 guaneec

Independent documented no-side-effects functions I'm completely fine with, it's changes to existing lines that scare me because I have to go through them all and understand what they change.

AUTOMATIC1111 avatar Jan 23 '23 15:01 AUTOMATIC1111

Oh, that's mostly indention. Just looks scary on GitHub; VSCode actually displays it nicely.

guaneec avatar Jan 23 '23 15:01 guaneec