data icon indicating copy to clipboard operation
data copied to clipboard

[DataPipe] Add RandomSplitter (without buffer)

Open NivekT opened this issue 2 years ago • 8 comments

Stack from ghstack:

  • -> #724

This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:

  • I decided against reusing _ChildDataPipe since its features are overly complicated for this use case.
  • I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for test and the second iteration is for valid. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion. See #723 for the version with buffer.

Differential Revision: D38675266

NivekT avatar Aug 09 '22 23:08 NivekT

Offline: Discussion:

  • This buffer-less version is likely better but we need more clear error message.
  • Let's support both syntax - if "target" is provided, then return only one DataPipe. Otherwise, returns a list of DataPipes. Look at the first commit.
  • We definitely want set_seed to allow changing of seed.
  • The default behavior should be same seed every epoch. We can have an argument to allow automatically changing of seed between epochs.

NivekT avatar Aug 12 '22 19:08 NivekT

@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

NivekT avatar Aug 12 '22 22:08 NivekT

@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

NivekT avatar Aug 12 '22 22:08 NivekT

@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

NivekT avatar Aug 12 '22 22:08 NivekT

Can we derive total_length from source Datapipe if possible?

VitalyFedyunin avatar Aug 15 '22 15:08 VitalyFedyunin

@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

NivekT avatar Aug 15 '22 16:08 NivekT

@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

NivekT avatar Aug 15 '22 18:08 NivekT

Can we derive total_length from source Datapipe if possible?

Updated the implementation to do that with an exception when it cannot infer length from the source.

NivekT avatar Aug 15 '22 18:08 NivekT

@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

NivekT avatar Aug 16 '22 22:08 NivekT

@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

NivekT avatar Aug 25 '22 19:08 NivekT

Thanks for the helpful comments. It is simpler than before now!

NivekT avatar Aug 25 '22 20:08 NivekT

@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

NivekT avatar Aug 25 '22 20:08 NivekT

@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

NivekT avatar Aug 29 '22 16:08 NivekT

@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

NivekT avatar Aug 29 '22 19:08 NivekT