batchgenerators
batchgenerators copied to clipboard
Proposal: reproducibility in DataLoader
Hi,
In DataLoader class, there are an argument seed_for_shuffle
which controls reproducibilty that when using DataLoader with infinite=False
and shuffle=True
(between two different trains, the data feeding the network will be always the same). But why when infinite=True
not doing the same? even if batch is infinite, between two different train you may want reproducibility.
So, there are two alternatives:
-
When doing
return np.random.choice(self.indices, self.batch_size, replace=True, p=self.sampling_probabilities)
in line 118 DataLoader do it usingself.rs
(i.e.return self.rs.choice(self.indices, self.batch_size, replace=True, p=self.sampling_probabilities)
). -
Instead of set
self.rs
at init of DataLoader instance, donp.random.seed()
.
Thanks for your time!
UPDATE:
To ensure that even if infinite=True
we have reproducibility, the second option must be done. With the first one, if other lines of code calling np.random
are executed, this reproducibility is lost, because the main seed is consumed. For example if the transformation we pass to the Single/MultiThreadedAugmentor is MirrorTransform, if we pass to the latter axes=(0,)
vs axes=(0, 1)
we will not get the same data back from the DataLoader having done only np.random.seed(seed)
, because inside MirrorTransforn we have executed 1 and 2 times, respectively, methods involving np.random
. I have seen it with this case.
Moreover. It must be the two options to ensure full reproducibility! This way, between different trainings where we only want to change, for example, the optimiser, we ensure that the same random transformations are always done in both trainings.
Best regards, Nácher.