datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Make `interleave_datasets` more robust

Open sbmaruf opened this issue 3 years ago • 3 comments

Is your feature request related to a problem? Please describe. Right now there are few hiccups using interleave_datasets. Interleaved dataset iterates until the smallest dataset completes it's iterator. In this way larger datasets may not complete full epoch of iteration. It creates new problems in calculation of epoch since there are no way to track which dataset from interleave_datasets completes how many epoch.

Describe the solution you'd like For interleave_datasets module,

  • [ ] Add a boolean argument --stop-iter in interleave_datasets that enables dataset to either iterate infinite amount of time or not. That means it should not return StopIterator exception in case --stop-iter=False.
  • [ ] Internal list variable iter_cnt that explains how many times (in steps/epochs) each dataset iterates at a given point.
  • [ ] Add an argument --max-iter (list type) that explain maximum amount of time each of the dataset can iterate. After complete --max-iter of one dataset, other dataset should continue sampling and when all the dataset finish their respective --max-iter, only then return StopIterator

Note: I'm new to datasets api. May be these features are already there in the datasets.

Since multitask training is the latest trends, I believe this feature would make the datasets api more popular.

@lhoestq

sbmaruf avatar Oct 12 '21 14:10 sbmaruf

Hi @lhoestq Any response on this issue?

sbmaruf avatar Jan 25 '22 17:01 sbmaruf

Hi ! Sorry for the late response

I agree interleave_datasets would benefit a lot from having more flexibility. If I understand correctly it would be nice to be able to define stopping strategies like stop="first_exhausted" (default) or stop="all_exhausted". If you'd like to contribute this feature I'd be happy to give you some pointers :)

Also one can already set the max number of iterations per dataset by doing dataset.take(n) on the dataset that should only have n samples.

Regarding the iter_cnt counter, I think this requires a bit more thoughts, since we might have to be able to backpropagate the the counter if map or other transforms have been applied after interleave_datasets.

lhoestq avatar Jan 26 '22 15:01 lhoestq

@sbmaruf I just notice that (1)interleave_datasets only samples indices once and reuse for all epochs, and (2) it's limited by the smallest dataset. Do you figure out an alternative way to achieve the same purpose?

memray avatar Jul 30 '22 08:07 memray