data
data copied to clipboard
Does torchdata already work with GCP and Azure blob storage
🚀 The feature
We already have an S3 integration and it seems like the S3 API already works with both
- Azure: https://devblogs.microsoft.com/cse/2016/05/22/access-azure-blob-storage-from-your-apps-using-s3-api/
- GCP: https://vamsiramakrishnan.medium.com/a-study-on-using-google-cloud-storage-with-the-s3-compatibility-api-324d31b8dfeb
Motivation, pitch
So ideally we can already support Azure, GCP without doing much
Alternatives
Build a new integration for each of Azure and GCP using their native APIs
h/t: @chauhang for the idea
Technical speaking, with fsspec
-DataPipe, torchdata has already working with cloud vendors.
- AWS: https://github.com/fsspec/s3fs
- Azure: https://github.com/fsspec/adlfs
- GCP: https://github.com/fsspec/gcsfs
Have you by any chance observed any perf impact from using fsspec vs the S3 integration. If not then agreed fsspec is a good option and we just need to spend some time authoring a tutorial
After the observation on the performance regression last time, I didn't get a chance to take a deeper look at the culprit. But, discussed with @ydaiming earlier, and he claimed that S3 integration works better on archive files but not on small pieces of files compared to boto3
(boto3
is the internal implementation of fsspec
).
Overall, in some cases, fsspec
does provide benefit to our users. So, adding more detailed instruction for fsspec
and talked about perf impact on the type of files might be a good step for now.
I am going to take a quick look into fsspec
vs s3
performance in my benchmark
My benchmark shows that using FSSpecFileOpener
is faster and it also provides the ability to stream (rather than downloading a whole archive into memory before reading).
My benchmark shows that using
FSSpecFileOpener
is faster and it also provides the ability to stream (rather than downloading a whole archive into memory before reading).
Our benchmarking results shows even for archives (large files) fsspec
performs better than the current implementation of S3Handler. I suspect this is caused by the downloading behavior. See: https://github.com/pytorch/data/issues/800
cc: @ydaiming
Since #812 and #836 have landed, I believe users should be able to use GCP and Azure Blob storage. Please feel free to re-open this issue or open a new issue if additional features are required. Thanks!