data
data copied to clipboard
[RFC] Make the TorchData Library Standalone from PyTorch Core Library
🚀 The feature
Note that this is a request for comment; currently, there is no plan to make TorchData a standalone library. We would like to solicit feedback from the community.
Proposal: Make the TorchData library standalone with little to no dependency on the PyTorch Core library (i.e. torch
).
Motivation, pitch
An argument for a standalone library is that, it will allow users to use all the data loading functionalities in this library without installing/using PyTorch. Datasets implemented using TorchData may become usable by other frameworks.
An argument against this change is - in order to certain DataLoader functionalities backward compatible with DataPipes, the torch
library may need to become dependent on TorchData instead.
The list of arguments here is not comprehensive, feel free to leave a comment about potential use cases and how they will be impacted.
Alternatives
Leave the library as it is with dependency on torch
.
Additional context
Please feel free to leave any comment/reaction to this proposal whether you are for or against this change. We'd like to hear from you!
cc: @VitalyFedyunin @ejguan @NivekT
I would really like this on my end. I'm writing a RL framework and part of the goal is to pickle a model and deploy it on a robotic system. I may or may not want torch actually installed there (maybe im using onnx?), but still have a pipeline for generically executing code.
Another feedback from our a downstream library (ray):
They previously provided an interface for users to transform ray.Dataset
into IterableDataset
. As we plan to collaborate to switch it to IterDataPipe
, they would have a hard requirement to use torch=1.11.0
. In order to provide a flexible requirements, they need to figure out how to fall back to use IterableDataset
if torch
doesn't meet the condition and torchdata
doesn't exist in the environment.
If we make torchdata out-of-tree where DataLoader2 and related utility all go into torchdata, we potentially could set a loose requirements over the version of torch
.
I'm working on a downstream library (Composer) that has the exact same issue -- we would love to allow users to build datasets with torchdata
without hard requirements on the torch
version.
I almost forget one of the most important benefits by making TorchData standalone. If we have some changes landed into PyTorch Core that, a dependent changes in TorchData have to wait until the nightly release is updated for PyTorch Core. On the other hand, if we have BC breaking changes in PyTorch Core, we have to open another PR in TorchData to incorporate such BC breaking then make sure the nightly release shipped for both repos. Otherwise, there is a risk that our downstream libraries like TorchVision/TorchText have red CI for a while.
Want to track the list of features we are currently depending on PyTorch Core:
- Dataset/IterableDataset
- Profiler
- default_collate_fn
- Sampler
The features that can be moved to TorchData easily:
- typing -> It basically being disabled for now
- MapDataPipe/IterDataPipe -> functional_datapipe
- graph utils
In order to make TD standalone, we may need to reverse a few dependencies from TD -> PyTorch to PyTorch -> TD.
- For BC, let DataLoader being able to handle DataPipe
- Question would be: do we want to add support for DataLoader2 to handle
Dataset
andIterableDataset
?
Just wanted to chime in and express my support for this RFC. My current workflow is based around the JAX ecosystem, which doesn't provide data loading functionalities and instead redirects you to existing ones ("not reinventing the wheel"). Of all data loading libraries I have tried, this is hands down the best, the most intuitive. I feel like people with different workflow from PyTorch can benefit from TorchData, but the dependency on torch
can be a dealbreaker for some.
Just for record, when decoupled, expecttest
can be removed from our test dependency.
100% support this. I primarily use torch but dang is the data loading library useful
I would love if TorchData becomes standalone. It could be the go to place for PyTorch Ecosystem libraries, since it essentially is along the same concept as torch.Dataset. All in all it would be genuinely amazing if TorchData becomes standalone.
Rooting for this :)
Hey there, any plans to make it as a separate python package w/o torch deps?
I like the concept of the data pipes and could imagine using it more generally. Pytorch is a big enough project so that with support from it, a stand-alone library could get a lot of community usage also in other projects. As it is, with the heavy pytorch dependency, it is impossible to use it in other projects - meaning the whole datapipes concept will be useful only for pytorch.
So yes - I think making it stand-alone would be a great idea. And the few functions that do require pytorch could be optional dependencies.
I support the library removing dependencies with Torch. Then it would be better if there was a shared memory method to pass the numpy array. Because the spawn process consumes a lot of memory when using pytorch and multiprocess.
The abstraction of the library is good, and it would be better if it was generic enough.
I support this endeavor.
Instead of completely removing torch as dependency, I would welcome it already, if all parts of trochdata were contained in this package instead of being split between torchdata
and torch.util.data
.
Hi, I would also be in favour of this
As @VIVelev above says, this would be exceptionally useful for the JAX ecosystem which has made an active choice not to support dataloading, e.g. their getting started tutorials use either:
- PT dataloaders: https://jax.readthedocs.io/en/latest/notebooks/Neural_Network_and_Data_Loading.html
- TF datasets: https://jax.readthedocs.io/en/latest/notebooks/neural_network_with_tfds_data.html
I really need torchdata as a standalone library! Can you please consider doing this?
Just throwing my two cents here - I'm also using TorchData for projects that do not use Torch so not having to import a huge package like Torch would be very helpful.