data
data copied to clipboard
[DataPipe] Add RandomSplitter (without buffer)
Stack from ghstack:
- -> #724
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).
Implementation note:
- I decided against reusing
_ChildDataPipe
since its features are overly complicated for this use case. - I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for
test
and the second iteration is forvalid
. Changing seed will be confusing and causes inconsistency.
See #712 for related discussion. See #723 for the version with buffer.
Differential Revision: D38675266
Offline: Discussion:
- This buffer-less version is likely better but we need more clear error message.
- Let's support both syntax - if "target" is provided, then return only one DataPipe. Otherwise, returns a list of DataPipes. Look at the first commit.
- We definitely want
set_seed
to allow changing ofseed
. - The default behavior should be same seed every epoch. We can have an argument to allow automatically changing of seed between epochs.
@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Can we derive total_length
from source Datapipe if possible?
@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Can we derive
total_length
from source Datapipe if possible?
Updated the implementation to do that with an exception when it cannot infer length from the source.
@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Thanks for the helpful comments. It is simpler than before now!
@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.