modin
modin copied to clipboard
Add support for Distributed loading on GCS buckets
Context
The company I work for, heavily uses pandas and Google Cloud Storage (GCS) buckets. We were interested in using Modin to improve performance of loading large datasets and improve data processing tasks.
Problem
However, it seems as though using GCS bucket urls for the read_csv() function of modin, is quite slow. The GCS bucket file patterns are gs://. Loading from local memory or S3 buckets works quite well, but the lack of support for GCS buckets results in Modin to default to pandas, which ends up being slower than using the vanilla pandas library!
Feature Request
We are requesting support for GCS buckets for distributed loading so that we aren't limited to S3 or local file storage only especially since the company I work for, uses Google Suite products quite a lot and does not use Amazon offerings.
Connected with #4742
@Anando304 thanks for opening the issue! This is a good point and we haven't properly addressed it yet (for reasons I don't know exactly). I'll work on a PR to add support for gs://.