datasets
datasets copied to clipboard
Make `interleave_datasets` more robust
Is your feature request related to a problem? Please describe.
Right now there are few hiccups using interleave_datasets
. Interleaved dataset iterates until the smallest dataset completes it's iterator. In this way larger datasets may not complete full epoch of iteration.
It creates new problems in calculation of epoch since there are no way to track which dataset from interleave_datasets
completes how many epoch.
Describe the solution you'd like
For interleave_datasets
module,
- [ ] Add a boolean argument
--stop-iter
ininterleave_datasets
that enables dataset to either iterate infinite amount of time or not. That means it should not returnStopIterator
exception in case--stop-iter=False
. - [ ] Internal list variable
iter_cnt
that explains how many times (in steps/epochs) each dataset iterates at a given point. - [ ] Add an argument
--max-iter
(list type) that explain maximum amount of time each of the dataset can iterate. After complete--max-iter
of one dataset, other dataset should continue sampling and when all the dataset finish their respective--max-iter
, only then returnStopIterator
Note: I'm new to datasets
api. May be these features are already there in the datasets.
Since multitask training is the latest trends, I believe this feature would make the datasets
api more popular.
@lhoestq
Hi @lhoestq Any response on this issue?
Hi ! Sorry for the late response
I agree interleave_datasets
would benefit a lot from having more flexibility. If I understand correctly it would be nice to be able to define stopping strategies like stop="first_exhausted"
(default) or stop="all_exhausted"
. If you'd like to contribute this feature I'd be happy to give you some pointers :)
Also one can already set the max number of iterations per dataset by doing dataset.take(n)
on the dataset that should only have n
samples.
Regarding the iter_cnt
counter, I think this requires a bit more thoughts, since we might have to be able to backpropagate the the counter if map
or other transforms have been applied after interleave_datasets
.
@sbmaruf I just notice that (1)interleave_datasets
only samples indices once and reuse for all epochs, and (2) it's limited by the smallest dataset. Do you figure out an alternative way to achieve the same purpose?