stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

Fixed proper dataset shuffling

Open FlameLaw opened this issue 3 years ago • 8 comments

Continuation of #3728 캡처 Turns out I wasn't hallucinating and torch.randperm was giving a fixed list as training continued. ([1, 93, ...]) This doesn't happen every time - I am suspecting that this is caused on full VRAM, or conflicting with multiprocess such as running image previews mid-training. So I swapped the function with a numpy one and it's working smoothly, even working properly when torch.randperm fails to become random.

The commit message should be fixed - this fix also applies to hypernetworks.

FlameLaw avatar Oct 27 '22 17:10 FlameLaw

do your image process cycle exactly matches with dataset length? also using fixed seed generation and settings? I agree that we need independent rng for shuffling, since its being fixed when we create image preview.

aria1th avatar Oct 27 '22 17:10 aria1th

I found the error where dataset length and image generation were equal. But the random fail also happened when it was not equal. Weird stuff. Also used fixed seed and settings.

ghost avatar Oct 27 '22 17:10 ghost

It seems to work, at least for me. I modified the code during training and after a thousand steps it generated more usable images. Thanks, and I hope it will be in the master as soon as possible.

mykeehu avatar Oct 27 '22 21:10 mykeehu

Can you use new Random object from python random, with fixed seed generation, then permutate from it? We'll need to make reproducable results, i.e. fix seeds used for result to reproduce same result.

aria1th avatar Oct 28 '22 13:10 aria1th

simply

from random import Random, randrange
# in __init__
self.seed = randrange(1<<32)
self.random = Random(self.seed) # initialize as None

# in shuffle

self.indexes = np.array(self.random.sample(range(self.dataset_length), self.dataset_length))

# add method to fix in future

def fix_seed(seed = None):
    self.seed = seed if seed is not None else randrange(1<<32)
    self.random = Random(self.seed)

aria1th avatar Oct 28 '22 13:10 aria1th

This is more of a torch problem, not an algorithm change. I have seen the problem with 100 training images and saving image per 100 steps or 107 steps on a RTX 3070, with logging each step to csv. I don't think this can be reproduced with simple random objects.

ghost avatar Oct 28 '22 13:10 ghost

No, I mean it would be better to have random information, and unique rng that is only used for dataset shuffling. In this way, we can reproduce training in same dataset + same rate + same setup (well not with dropout...), which might help research in future.

aria1th avatar Oct 28 '22 13:10 aria1th

Can't think of a way to make an endless random permutation with seeds, so if you happen to make one, you'll have to set a fixed amount of training steps. Anyway that's unrelated to what this PR is trying to solve

ghost avatar Oct 28 '22 14:10 ghost