The `shuffle_after_epoch` parameter in `fn.readers.numpy` is independent of the `seed`.
Hi,
As expressed in issue #4319 and since it hasn't been solved yet (not sure why this issue was closed), I kindly would like to know if it is possible to make shuffle_after_epoch in the fn.readers.numpy dependent on the seed parameter? Currently the dataset is shuffled in a different way after every epoch but if I restart the same iterator with a different seed I will still get the same shuffled datasets after every epoch in the same order.
The fn.readers.numpy is the only data reader (as far as I know) which provides GPUDirect storage support, which I do think gives NVidia Dali dataloader an edge. I really think this a major flaw in the design of the numpy reader, would be great if you could solve this issue.
Thanks :)
Hi @acecchini,
Thank you for reaching out.
I really think this a major flaw in the design of the numpy reader, would be great if you could solve this issue.
This was a conscious decision to proceed with such a design choice. The rationale is to make sure that the shuffling has the same pattern across the DALI instances running on different GPUs to make sure that shards don't overlap. Another option we considered was to ask the user to provide the same seed across the DALI pipeline instances.
The fn.readers.numpy is the only data reader (as far as I know) which provides GPUDirect storage support
Can you tell us more about your use case? Why the default shuffling mode that doesn't fit your needs?
Hi @JanuszL,
Thank you for your prompt response.
Another option we considered was to ask the user to provide the same seed across the DALI pipeline instances.
I do believe that is a much better design choice. In fact I use DALI together with Jax and the DALI Jax's data_iterator does eaxctly this. It passes the same seed to all the pipelines (but of course passes different shard_id and device_id).
Can you tell us more about your use case? Why the default shuffling mode that doesn't fit your needs?
Well, first of all, from a theoretical standpoint, it violates some hypothesis over the sampling process. In machine learning, and statistics more generally, we first design a theoretical model assuming that we can sample from the true distribution to which the data belongs to. In practice, however, we only have access to a finite dataset, which is sufficiently large to assume being in the law of large number regime. We then associate a discrete uniform distribution to this dataset and construct an estimator of our loss by sampling from this distribution. In most usecase, because of how dataloader are designed, instead of sampling uniformly we sample without replacement from the dataset until exhaustion (which is corresponds to an epoch cycle) and do this again for a sufficiently large number of cycles. We then construct a new estimator of our loss with this new distribution. However, in order for our estimator to converge to the true loss, we assume that the sampling process is actually random, otherwise this estimator is not theoretically guaranteed to converge. Even if the shuffling is different for each epoch, the corresponding permutations associated and their order will remain the same for every new training, which violates the true randomness assumption.
On a more intuitive perspective, that means during training you will always encounter the same datapoints in the same order, and therefore the search space will necessarily be restricted. Suppose the neural network is a function f of weights theta and data input x; and the loss L is function of f(x, theta) and the data output y. Since the order D=((x_i, y_i))_i is determined (read here a data sequence with multiple epochs concatenated), the gradients' sequence involved in the gradient descent will always be a function of D and of the weights theta, the latter being the only source of randomness since we sample it randomly for every new training. That means that the search space and the local minimas in which the weights of the model could potentially fall into, is mechanically restricted. Gradients applied at the beginning of a training do not have the same influence on the training dynamics than the one applied towards the end or at intermediate steps. Furthermore, we generally apply an optimizer schedule over the learning rate which adds even more dependence of the training dynamics over the gradients sequence.
Hi, first of all, thanks for labelling this issue as an enhancement. I am following up to know if you are planning to actually implement this enhancement soon? My whole training pipeline is based on Dali but if I am unable to shuffle the dataset with different seeds it is problematic.. Thanks 😊
Thank you, @acecchini, for refreshing the topic.
Currently, we are pursuing other priorities, but if you are willing to contribute to the project, we are happy to assist and guide you through the necessary changes.
I am happy to contribute if it can be done in Python! However, if the piece of code to modify is written in c++, then the task goes beyond my skills 😅
If it happens to be a mixture of both, we could possibly split the task if you agree?
I'm afraid that most, if not all, goes to the native part.
In that case, I can only try to convince you and stress once again the importance of this issue. It really is a major flaw that prevents ml practitioner like me from carrying out experiments grounded in statistical foundations.
Anyway thanks for taking the time to answer me 😊
Still no plans in implementing this enahncement? :eyes:
Promess then I will stop asking and probably migrate to google grain dataloader and loose GPUDirect feature :(
Thanks anyway for taking the time to answer me!
Hi @acecchini,
Unfortunately, this enhancement is not as high on our ToDo list as you might wish. However, I encourage you to try implementing it on your own. It could be a great opportunity to see how vibe coding works for you on a project like DALI.
Maybe if you describe me what needs to be done I can see what I can do, but I cannot promess you anything..
@acecchini,
Here are some thoughts on adding the functionality you're requesting:
- Replace
kDaliDataloaderSeedwith a user-provided seed. To achieve this, for every operator that accepts the "shuffle_after_epoch" argument (like here), I would add a - "shuffle_after_epoch_seed" argument:.AddOptionalArg<uint64_t>("shuffle_after_epoch_seed", R"(Seed for shuffling after each epoch)", nullptr, false); - Then, propagate this argument to file_label_loader.h. Once defined and available, it should be used instead of
kDaliDataloaderSeed. - Finally, extend the following tests:
@mzient - Do you think this is a viable approach?
Alright, thanks for the leads, I'll try to do this when I find some time. Probably expect this to take several months!