data icon indicating copy to clipboard operation
data copied to clipboard

[RFC] Make the TorchData Library Standalone from PyTorch Core Library

Open NivekT opened this issue 2 years ago • 16 comments

🚀 The feature

Note that this is a request for comment; currently, there is no plan to make TorchData a standalone library. We would like to solicit feedback from the community.

Proposal: Make the TorchData library standalone with little to no dependency on the PyTorch Core library (i.e. torch).

Motivation, pitch

An argument for a standalone library is that, it will allow users to use all the data loading functionalities in this library without installing/using PyTorch. Datasets implemented using TorchData may become usable by other frameworks.

An argument against this change is - in order to certain DataLoader functionalities backward compatible with DataPipes, the torch library may need to become dependent on TorchData instead.

The list of arguments here is not comprehensive, feel free to leave a comment about potential use cases and how they will be impacted.

Alternatives

Leave the library as it is with dependency on torch.

Additional context

Please feel free to leave any comment/reaction to this proposal whether you are for or against this change. We'd like to hear from you!

cc: @VitalyFedyunin @ejguan @NivekT

NivekT avatar Mar 11 '22 20:03 NivekT

I would really like this on my end. I'm writing a RL framework and part of the goal is to pickle a model and deploy it on a robotic system. I may or may not want torch actually installed there (maybe im using onnx?), but still have a pipeline for generically executing code.

josiahls avatar Mar 13 '22 13:03 josiahls

Another feedback from our a downstream library (ray): They previously provided an interface for users to transform ray.Dataset into IterableDataset. As we plan to collaborate to switch it to IterDataPipe, they would have a hard requirement to use torch=1.11.0. In order to provide a flexible requirements, they need to figure out how to fall back to use IterableDataset if torch doesn't meet the condition and torchdata doesn't exist in the environment.

If we make torchdata out-of-tree where DataLoader2 and related utility all go into torchdata, we potentially could set a loose requirements over the version of torch.

ejguan avatar Mar 16 '22 18:03 ejguan

I'm working on a downstream library (Composer) that has the exact same issue -- we would love to allow users to build datasets with torchdata without hard requirements on the torch version.

abhi-mosaic avatar Mar 16 '22 18:03 abhi-mosaic

I almost forget one of the most important benefits by making TorchData standalone. If we have some changes landed into PyTorch Core that, a dependent changes in TorchData have to wait until the nightly release is updated for PyTorch Core. On the other hand, if we have BC breaking changes in PyTorch Core, we have to open another PR in TorchData to incorporate such BC breaking then make sure the nightly release shipped for both repos. Otherwise, there is a risk that our downstream libraries like TorchVision/TorchText have red CI for a while.

ejguan avatar Apr 05 '22 22:04 ejguan

Want to track the list of features we are currently depending on PyTorch Core:

  • Dataset/IterableDataset
  • Profiler
  • default_collate_fn
  • Sampler

The features that can be moved to TorchData easily:

  • typing -> It basically being disabled for now
  • MapDataPipe/IterDataPipe -> functional_datapipe
  • graph utils

In order to make TD standalone, we may need to reverse a few dependencies from TD -> PyTorch to PyTorch -> TD.

  • For BC, let DataLoader being able to handle DataPipe
  • Question would be: do we want to add support for DataLoader2 to handle Dataset and IterableDataset?

ejguan avatar Apr 12 '22 18:04 ejguan

Just wanted to chime in and express my support for this RFC. My current workflow is based around the JAX ecosystem, which doesn't provide data loading functionalities and instead redirects you to existing ones ("not reinventing the wheel"). Of all data loading libraries I have tried, this is hands down the best, the most intuitive. I feel like people with different workflow from PyTorch can benefit from TorchData, but the dependency on torch can be a dealbreaker for some.

VIVelev avatar Apr 25 '22 14:04 VIVelev

Just for record, when decoupled, expecttest can be removed from our test dependency.

ejguan avatar May 25 '22 16:05 ejguan

100% support this. I primarily use torch but dang is the data loading library useful

bushshrub avatar May 27 '22 11:05 bushshrub

I would love if TorchData becomes standalone. It could be the go to place for PyTorch Ecosystem libraries, since it essentially is along the same concept as torch.Dataset. All in all it would be genuinely amazing if TorchData becomes standalone.

Rooting for this :)

Atharva-Phatak avatar Jun 14 '22 23:06 Atharva-Phatak

Hey there, any plans to make it as a separate python package w/o torch deps?

Red-Eyed avatar Nov 11 '22 11:11 Red-Eyed

I like the concept of the data pipes and could imagine using it more generally. Pytorch is a big enough project so that with support from it, a stand-alone library could get a lot of community usage also in other projects. As it is, with the heavy pytorch dependency, it is impossible to use it in other projects - meaning the whole datapipes concept will be useful only for pytorch.

So yes - I think making it stand-alone would be a great idea. And the few functions that do require pytorch could be optional dependencies.

hhoeflin avatar Apr 25 '23 09:04 hhoeflin

I support the library removing dependencies with Torch. Then it would be better if there was a shared memory method to pass the numpy array. Because the spawn process consumes a lot of memory when using pytorch and multiprocess.

The abstraction of the library is good, and it would be better if it was generic enough.

One-sixth avatar Apr 27 '23 06:04 One-sixth

I support this endeavor.

Instead of completely removing torch as dependency, I would welcome it already, if all parts of trochdata were contained in this package instead of being split between torchdata and torch.util.data.

sehoffmann avatar May 30 '23 16:05 sehoffmann

Hi, I would also be in favour of this

As @VIVelev above says, this would be exceptionally useful for the JAX ecosystem which has made an active choice not to support dataloading, e.g. their getting started tutorials use either:

  • PT dataloaders: https://jax.readthedocs.io/en/latest/notebooks/Neural_Network_and_Data_Loading.html
  • TF datasets: https://jax.readthedocs.io/en/latest/notebooks/neural_network_with_tfds_data.html

julianmack avatar Jun 09 '23 08:06 julianmack

I really need torchdata as a standalone library! Can you please consider doing this?

taeefnajib avatar Jun 14 '23 02:06 taeefnajib

Just throwing my two cents here - I'm also using TorchData for projects that do not use Torch so not having to import a huge package like Torch would be very helpful.

seunggs avatar Jun 14 '23 15:06 seunggs