batchgenerators Proposal: reproducibility in DataLoader

Proposal: reproducibility in DataLoader

Open CarlosNacher opened this issue 2 years ago • 1 comments

Hi,

In DataLoader class, there are an argument seed_for_shuffle which controls reproducibilty that when using DataLoader with infinite=False and shuffle=True (between two different trains, the data feeding the network will be always the same). But why when infinite=True not doing the same? even if batch is infinite, between two different train you may want reproducibility.

So, there are two alternatives:

When doing return np.random.choice(self.indices, self.batch_size, replace=True, p=self.sampling_probabilities) in line 118 DataLoader do it using self.rs (i.e. return self.rs.choice(self.indices, self.batch_size, replace=True, p=self.sampling_probabilities)).
Instead of set self.rs at init of DataLoader instance, do np.random.seed().

Thanks for your time!

Aug 10 '22 16:08 CarlosNacher

UPDATE:

To ensure that even if infinite=True we have reproducibility, the second option must be done. With the first one, if other lines of code calling np.random are executed, this reproducibility is lost, because the main seed is consumed. For example if the transformation we pass to the Single/MultiThreadedAugmentor is MirrorTransform, if we pass to the latter axes=(0,) vs axes=(0, 1) we will not get the same data back from the DataLoader having done only np.random.seed(seed), because inside MirrorTransforn we have executed 1 and 2 times, respectively, methods involving np.random. I have seen it with this case.

Moreover. It must be the two options to ensure full reproducibility! This way, between different trainings where we only want to change, for example, the optimiser, we ensure that the same random transformations are always done in both trainings.

Best regards, Nácher.

Aug 10 '22 17:08 CarlosNacher

batchgenerators batchgenerators copied to clipboard

Proposal: reproducibility in DataLoader

batchgenerators
batchgenerators copied to clipboard