data
data copied to clipboard
Linter for DataPipe/DataLoader2
🚀 The feature
This issue proposes the addition of a linter for DataPipes and DataLoader2. The linter can analyze the graph of DataPipes and input arguments to DataLoaderV, and inform the users if any errors may occur ahead of time. The incomplete list of issues that the linter may try to analyze and raise is below. Please feel free to edit the list directly to add more or comment below.
Essential:
- [ ] Multiple references to the same iterator/DataPipe
- This can cause issue when serialized, suggest users to
fork
- This can cause issue when serialized, suggest users to
- [ ] Duplicate usage of shuffle/batch/collate
- [ ] Shuffle/batch/collate are missing?
- [ ] Warn if shuffling is not done?
- [ ] Warn if sharding is not specificed for Distributed/Multiprocessing
- [ ] Warn about shuffling before sharding (not mandatory because inputs may be pre-shuffled)
- [ ] Multiprocess/distributed behavior related to sharding/shuffling
- [ ] Warn if filter appears between on_disk_cache and end_caching sections.
- [ ] Find unreachable children within graph and warns (because they might prevent buffers from being empty in
fork
and etc) - [ ] Warn about passing DataPipes that have already been partially read (invalid state), but are passed into DataLoader (and we might have to force
reset
the DataPipe in DataLoader) - [ ] Detect what external packages are not installed within DataPipe graph
Nice-to-have:
- [ ] Check DataPipe object size and warn if it is too big (e.g. premature initialization of large structures)
- [ ] Check if
fork
datapipe creates two or more copies ofStreamWrapper
orIOBase
Motivation, pitch
Having a linter will encourage best practices of DataPipe usages and reduces the number of unexpected bugs/behaviors in the data loading process during runtime.
Alternatives
Only raise exceptions during runtime.
Additional context
This linter is expected to work with DataPipes and DataLoaderV2. We should consider if it should work with the original DataLoader as well (and how).
cc: @VitalyFedyunin @ejguan
IIRC, we should enforce shuffling before sharding. cc @NicolasHug
If datasource is already shuffled to some extent, shard before shuffle might be valid operation. So warning IMO suffice.
We should add one more linter to check DataPipe object size and warn if it is too big (ex premature initialization of large structures).
#429 This PR introduces the linter to check if there is a shuffle
before each sharding
. This should mainly be used by domain libraries to verify their implementation. Note: users can still disable shuffle using DL2
For user-facing linter, we might provide a variant from this linter function. Raise warning when no shuffle
before sharding
and raise Error/warning if there is a shuffle
behind `sharding.
[ ] Warn if filter
appears between on_disk_cache
and end_caching
sections.