data Does torchdata already work with GCP and Azure blob storage

🚀 The feature

We already have an S3 integration and it seems like the S3 API already works with both

Azure: https://devblogs.microsoft.com/cse/2016/05/22/access-azure-blob-storage-from-your-apps-using-s3-api/
GCP: https://vamsiramakrishnan.medium.com/a-study-on-using-google-cloud-storage-with-the-s3-compatibility-api-324d31b8dfeb

Motivation, pitch

So ideally we can already support Azure, GCP without doing much

Alternatives

Build a new integration for each of Azure and GCP using their native APIs

h/t: @chauhang for the idea

Sep 27 '22 21:09 msaroufim

Technical speaking, with fsspec-DataPipe, torchdata has already working with cloud vendors.

AWS: https://github.com/fsspec/s3fs
Azure: https://github.com/fsspec/adlfs
GCP: https://github.com/fsspec/gcsfs

Sep 28 '22 13:09 ejguan

Have you by any chance observed any perf impact from using fsspec vs the S3 integration. If not then agreed fsspec is a good option and we just need to spend some time authoring a tutorial

Sep 28 '22 15:09 msaroufim

After the observation on the performance regression last time, I didn't get a chance to take a deeper look at the culprit. But, discussed with @ydaiming earlier, and he claimed that S3 integration works better on archive files but not on small pieces of files compared to boto3 (boto3 is the internal implementation of fsspec).

Overall, in some cases, fsspec does provide benefit to our users. So, adding more detailed instruction for fsspec and talked about perf impact on the type of files might be a good step for now.

Sep 28 '22 18:09 ejguan

I am going to take a quick look into fsspec vs s3 performance in my benchmark

Sep 28 '22 22:09 NivekT

My benchmark shows that using FSSpecFileOpener is faster and it also provides the ability to stream (rather than downloading a whole archive into memory before reading).

Sep 30 '22 15:09 NivekT

My benchmark shows that using FSSpecFileOpener is faster and it also provides the ability to stream (rather than downloading a whole archive into memory before reading).

Our benchmarking results shows even for archives (large files) fsspec performs better than the current implementation of S3Handler. I suspect this is caused by the downloading behavior. See: https://github.com/pytorch/data/issues/800 cc: @ydaiming

Sep 30 '22 15:09 ejguan

Since #812 and #836 have landed, I believe users should be able to use GCP and Azure Blob storage. Please feel free to re-open this issue or open a new issue if additional features are required. Thanks!

Oct 20 '22 17:10 NivekT

data data copied to clipboard

Does torchdata already work with GCP and Azure blob storage

🚀 The feature

Motivation, pitch

Alternatives

data
data copied to clipboard