croissant Native Loader Integration IR

For peak performance, each loader may have its own way of achieving certain operations. It would be useful to offer an intermediate representation that can be "lowered" to each respective dataloader variety. For example, I can imagine something like (loosely):

READ_FILES(["a.txt", "b.csv"]),
MAP(UNPACK_FILES),
SORT_VALUES("column_1"),
...

Croissant also has the entire graph of computations, so that is suitable as well. I wonder if it is worth exporting such an internal representation so that dataloaders can implement their own visitor pattern to "compile" down to the native code without having to use intermediate data representations.

Jan 17 '24 19:01 mkuchnik

@mkuchnik Hey Michael, that's sounds very nice! Another way to look at it is to care about performance in some isolated cases only:

If the graph of computations is sequential (e.g., no join)
If the underlying data is "ML-optimized" (Parquet, ArrayRecord, TFRecord)

Indeed, this means that the data is already prepared for data-intensive ML workflows. In other cases:

If the graph of computations is NOT sequential, probably the data needs some pre-processing
If the underlying data is NOT ML-optimized, probably reading will be a bottleneck anyway (either for deserialization or for random access)

So in the cases 1. and 2. are true, we could adapt tfds.data_source and datasets.Dataset to work smoothly with torch.utils.data.DataLoader and Croissant.

What do you think? Also, what do you mean when you say "each respective dataloader variety"?

Feb 06 '24 14:02 marcenacp

@marcenacp This is roughly what I had in mind and indeed that criteria seems like an appropriate fast-path. It would be great to avoid the intermediate serialization/copies in such cases.

For "each respective dataloader variety", you likely want a closure that is most compatible with the backend. For example, if the backend supports native operators, it may be more efficient to use those than plain Python.

Feb 13 '24 22:02 mkuchnik