data icon indicating copy to clipboard operation
data copied to clipboard

[Experiment] Make S3Handler.s3_read return a stream rather than bytes

Open ejguan opened this issue 1 year ago • 2 comments

Originally I was expecting the returned stream from S3handler is non-seekable stream. But, it turns out that the whole archive/files will be dumped into memory based on the implementation (I might be wrong about it then I need someone to validate it)

  • C++ side: https://github.com/pytorch/data/blob/a435c7f5f7543e3614130eb2ecee520396b8efd2/torchdata/csrc/pybind/pybind.cpp#L28
  • Python side: https://github.com/pytorch/data/blob/a435c7f5f7543e3614130eb2ecee520396b8efd2/torchdata/datapipes/iter/load/s3io.py#L135 And, that is the reason that the performance seems on parity with or without BytesIO in this issue.

In order to make it streaming, we need to have a way to pybind C++ stream IO to python, which is non-trivial. See a code example: https://github.com/CadQuery/OCP/blob/master/pystreambuf.h

Potentially this change would accelerate data preprocessing. But, it needs to be extensively benchmarked.

ejguan avatar Sep 29 '22 18:09 ejguan

And, there is a use case that might affect the performance on S3FileLoader. If I do tarfile.open(fileobj=s3_stream_returned_from_s3fileloader, mode=m, bufsize=20000000240), the speed with mode r: is way faster than the mode r|

ejguan avatar Sep 29 '22 18:09 ejguan

cc: @ydaiming for confirmation about the files are dumped into memory rather than streaming from S3Handler.s3_read. And, do you want to see if there will be benefit to revamp it to an iostream?

ejguan avatar Sep 29 '22 18:09 ejguan