data icon indicating copy to clipboard operation
data copied to clipboard

Does torchdata already work with GCP and Azure blob storage

Open msaroufim opened this issue 2 years ago • 6 comments

🚀 The feature

We already have an S3 integration and it seems like the S3 API already works with both

  • Azure: https://devblogs.microsoft.com/cse/2016/05/22/access-azure-blob-storage-from-your-apps-using-s3-api/
  • GCP: https://vamsiramakrishnan.medium.com/a-study-on-using-google-cloud-storage-with-the-s3-compatibility-api-324d31b8dfeb

Motivation, pitch

So ideally we can already support Azure, GCP without doing much

Alternatives

Build a new integration for each of Azure and GCP using their native APIs

h/t: @chauhang for the idea

msaroufim avatar Sep 27 '22 21:09 msaroufim

Technical speaking, with fsspec-DataPipe, torchdata has already working with cloud vendors.

  • AWS: https://github.com/fsspec/s3fs
  • Azure: https://github.com/fsspec/adlfs
  • GCP: https://github.com/fsspec/gcsfs

ejguan avatar Sep 28 '22 13:09 ejguan

Have you by any chance observed any perf impact from using fsspec vs the S3 integration. If not then agreed fsspec is a good option and we just need to spend some time authoring a tutorial

msaroufim avatar Sep 28 '22 15:09 msaroufim

After the observation on the performance regression last time, I didn't get a chance to take a deeper look at the culprit. But, discussed with @ydaiming earlier, and he claimed that S3 integration works better on archive files but not on small pieces of files compared to boto3 (boto3 is the internal implementation of fsspec).

Overall, in some cases, fsspec does provide benefit to our users. So, adding more detailed instruction for fsspec and talked about perf impact on the type of files might be a good step for now.

ejguan avatar Sep 28 '22 18:09 ejguan

I am going to take a quick look into fsspec vs s3 performance in my benchmark

NivekT avatar Sep 28 '22 22:09 NivekT

My benchmark shows that using FSSpecFileOpener is faster and it also provides the ability to stream (rather than downloading a whole archive into memory before reading).

NivekT avatar Sep 30 '22 15:09 NivekT

My benchmark shows that using FSSpecFileOpener is faster and it also provides the ability to stream (rather than downloading a whole archive into memory before reading).

Our benchmarking results shows even for archives (large files) fsspec performs better than the current implementation of S3Handler. I suspect this is caused by the downloading behavior. See: https://github.com/pytorch/data/issues/800 cc: @ydaiming

ejguan avatar Sep 30 '22 15:09 ejguan

Since #812 and #836 have landed, I believe users should be able to use GCP and Azure Blob storage. Please feel free to re-open this issue or open a new issue if additional features are required. Thanks!

NivekT avatar Oct 20 '22 17:10 NivekT