data Linter for DataPipe/DataLoader2

🚀 The feature

This issue proposes the addition of a linter for DataPipes and DataLoader2. The linter can analyze the graph of DataPipes and input arguments to DataLoaderV, and inform the users if any errors may occur ahead of time. The incomplete list of issues that the linter may try to analyze and raise is below. Please feel free to edit the list directly to add more or comment below.

Essential:

[ ] Multiple references to the same iterator/DataPipe
- This can cause issue when serialized, suggest users to fork
[ ] Duplicate usage of shuffle/batch/collate
[ ] Shuffle/batch/collate are missing?
[ ] Warn if shuffling is not done?
[ ] Warn if sharding is not specificed for Distributed/Multiprocessing
[ ] Warn about shuffling before sharding (not mandatory because inputs may be pre-shuffled)
[ ] Multiprocess/distributed behavior related to sharding/shuffling
[ ] Warn if filter appears between on_disk_cache and end_caching sections.
[ ] Find unreachable children within graph and warns (because they might prevent buffers from being empty in fork and etc)
[ ] Warn about passing DataPipes that have already been partially read (invalid state), but are passed into DataLoader (and we might have to force reset the DataPipe in DataLoader)
[ ] Detect what external packages are not installed within DataPipe graph

Nice-to-have:

[ ] Check DataPipe object size and warn if it is too big (e.g. premature initialization of large structures)
[ ] Check if fork datapipe creates two or more copies of StreamWrapper or IOBase

Motivation, pitch

Having a linter will encourage best practices of DataPipe usages and reduces the number of unexpected bugs/behaviors in the data loading process during runtime.

Alternatives

Only raise exceptions during runtime.

Additional context

This linter is expected to work with DataPipes and DataLoaderV2. We should consider if it should work with the original DataLoader as well (and how).

cc: @VitalyFedyunin @ejguan

Apr 19 '22 21:04 NivekT

IIRC, we should enforce shuffling before sharding. cc @NicolasHug

Apr 20 '22 15:04 pmeier

If datasource is already shuffled to some extent, shard before shuffle might be valid operation. So warning IMO suffice.

Apr 20 '22 18:04 VitalyFedyunin

We should add one more linter to check DataPipe object size and warn if it is too big (ex premature initialization of large structures).

Apr 20 '22 18:04 VitalyFedyunin

#429 This PR introduces the linter to check if there is a shuffle before each sharding. This should mainly be used by domain libraries to verify their implementation. Note: users can still disable shuffle using DL2

For user-facing linter, we might provide a variant from this linter function. Raise warning when no shuffle before sharding and raise Error/warning if there is a shuffle behind `sharding.

May 23 '22 15:05 ejguan

[ ] Warn if filter appears between on_disk_cache and end_caching sections.

Jun 01 '22 21:06 VitalyFedyunin

data data copied to clipboard

Linter for DataPipe/DataLoader2

🚀 The feature

Motivation, pitch

Alternatives

Additional context

data
data copied to clipboard