data
data copied to clipboard
[DataPipe] Ensure all DataPipes Meet Testing Requirements
🚀 Feature
We have many tests for existing DataPipes (both in PyTorch Core and TorchData). However, over time, they have become less organized. Moreover, as the testing requirements expand, older DataPipes may not have tests to cover the newly added requirements.
This issue aims to track the status of tests for all DataPipes.
Motivation
We want to ensure test coverage for all DataPipe is complete to reduce bugs and unexpected behavior.
Alternative
We also should create some testing templates for IterDataPipe
and MapDataPipe
that can be widely applied.
IterDataPipe
Tracker
X - Done NA - Not Applicable Blank - Not Done/Unclear
Test definitions:
Functional - unit test to ensure that the DataPipe works properly with various input arguments
Reset - DataPipe can be reset/restart after being read
__len__
- the __len__
method is implemented whenever possible (or explicitly not implemented)
Serializable - DataPipe is serializable
Graph (future) - can be traversed as part of a DataPipe graph
Snapshot (future) - can be saved/loaded as a checkpoint/snapshot
Name | Module | Functional Test | Reset | __len__ |
Serializable (Pickable) | Graph | Snapshot |
---|---|---|---|---|---|---|---|
Batcher | Core | X | X | X | X | ||
Collator | Core | X | X | X | X | ||
Concater | Core | X | X | X | X | ||
Demultiplexer | Core | X | X | X | X | ||
FileLister | Core | X | X | X | X | ||
FileOpener | Core | X | X | X | X | ||
Filter | Core | X | X | X | X | ||
Forker | Core | X | X | X | X | ||
Grouper | Core | X | X | X | |||
IterableWrapper | Core | X | X | X | X | ||
Mapper | Core | X | X | X | X | ||
Multiplexer | Core | X | X | X | X | ||
RoutedDecoder | Core | X | X | X | X | ||
Sampler | Core | X | X | X | X | ||
Shuffler | Core | X | X | X | X | ||
StreamReader | Core | X | X | X | X | ||
UnBatcher | Core | X | X | X | |||
Zipper | Core | X | X | X | X | ||
BucketBatcher | Data | X | X | X | X | ||
CSVDictParser | Data | X | X | X | X | ||
CSVParser | Data | X | X | X | X | ||
Cycler | Data | X | X | X | X | ||
DataFrameMaker | Data | X | X | X | X | ||
Decompressor | Data | X | X | X | X | ||
Enumerator | Data | X | X | X | X | ||
FlatMapper | Data | X | X | X | X | ||
FSSpecFileLister | Data | X | X | X | X | ||
FSSpecFileOpener | Data | X | X | X | X | ||
FSSpecSaver | Data | X | X | X | X | ||
GDriveReader | Data | X | X | X | X | ||
HashChecker | Data | X | X | X | X | ||
Header | Data | X | X | X | X | ||
HttpReader | Data | X | X | X | X | ||
InMemoryCacheHolder | Data | X | X | X | X | ||
IndexAdder | Data | X | X | X | X | ||
IoPathFileLister | Data | X | X | X | X | ||
IoPathFileOpener | Data | X | X | X | X | ||
IoPathSaver | Data | X | X | X | X | ||
IterKeyZipper | Data | X | X | X | X | ||
JsonParser | Data | X | X | X | X | ||
LineReader | Data | X | X | X | X | ||
MapKeyZipper | Data | X | X | X | X | ||
OnDiskCacheHolder | Data | X | X | X | X | ||
OnlineReader | Data | X | X | X | X | ||
ParagraphAggregator | Data | X | X | X | X | ||
ParquetDataFrameLoader | Data | X | X | X | X | ||
RarArchiveLoader | Data | X | X | X | X | ||
Rows2Columnar | Data | X | X | X | X | ||
SampleMultiplexer | Data | X | X | X | X | ||
Saver | Data | X | X | X | X | ||
TarArchiveLoader | Data | X | X | X | X | ||
UnZipper | Data | X | X | X | X | ||
XzFileLoader | Data | X | X | X | X | ||
ZipArchiveLoader | Data | X | X | X | X |
MapDataPipe
Tracker
X - Done NA - Not Applicable Blank - Not Done/Unclear
Name | Module | Functional Test | __len__ |
Serializable (Pickable) | Graph | Snapshot |
---|---|---|---|---|---|---|
Batcher | Core | X | X | |||
Concater | Core | X | X | |||
Mapper | Core | X | X | X | ||
SequenceWrapper | Core | X | X | X | ||
Shuffler | Core | X | X | |||
Zipper | Core | X | X |
cc: @ejguan @VitalyFedyunin @NivekT
This is awesome. One nit note: serializable should be same as picklable IMO.
@NivekT I am concerning about when and how we want to do graph testing. For a single DataPipe instance, the graph testing makes no sense. Then, we may want to construct a datapipe graph by ourselves. The problem is how we can guarantee the testing coverage for all use cases.
We can require each DataPipe to introduce a simple usage example graph for this purpose.
When we have time, we might need to go over our DataPipes again to identify any missing test since there are a few DataPipe implemented recently.
Besides, for future reference, we might need to improve our testing framework to something similar to OpInfo
in PyTorch Core to run the testing coverage automatically without we go over each test by ourselves.
When we have time, we might need to go over our DataPipes again to identify any missing test since there are a few DataPipe implemented recently.
Besides, for future reference, we might need to improve our testing framework to something similar to
OpInfo
in PyTorch Core to run the testing coverage automatically without we go over each test by ourselves.
Agreed that the OpInfo
-like way is probably the best. I think our inputs and necessary setup for each test is a bit all over the place. Having tests split between two repos doesn't help either.