data
data copied to clipboard
[Experiment] Make S3Handler.s3_read return a stream rather than bytes
Originally I was expecting the returned stream from S3handler
is non-seekable stream. But, it turns out that the whole archive/files will be dumped into memory based on the implementation (I might be wrong about it then I need someone to validate it)
- C++ side: https://github.com/pytorch/data/blob/a435c7f5f7543e3614130eb2ecee520396b8efd2/torchdata/csrc/pybind/pybind.cpp#L28
- Python side: https://github.com/pytorch/data/blob/a435c7f5f7543e3614130eb2ecee520396b8efd2/torchdata/datapipes/iter/load/s3io.py#L135
And, that is the reason that the performance seems on parity with or without
BytesIO
in this issue.
In order to make it streaming, we need to have a way to pybind C++ stream IO to python, which is non-trivial. See a code example: https://github.com/CadQuery/OCP/blob/master/pystreambuf.h
Potentially this change would accelerate data preprocessing. But, it needs to be extensively benchmarked.
And, there is a use case that might affect the performance on S3FileLoader.
If I do tarfile.open(fileobj=s3_stream_returned_from_s3fileloader, mode=m, bufsize=20000000240)
, the speed with mode r:
is way faster than the mode r|
cc: @ydaiming for confirmation about the files are dumped into memory rather than streaming from S3Handler.s3_read
. And, do you want to see if there will be benefit to revamp it to an iostream?