fugue
fugue copied to clipboard
[FEATURE] Enable loading of list of files into dask backend
Is your feature request related to a problem? Please describe. Dasks read_parquet (as well as other loaders, like csv) can accept list of files. This allows multiple dask distributed workers to load partition of data each, without data ever be loaded into client memory in it's entirety. This is important for datasets that can't fit into single machines memory.
Fugue SQL supports LOAD
statement that passes this to dask backend. This statement, unfortunately, doesn't allow for list of files.
Describe the solution you'd like Allow loading of multiple parquet files by passing a list. For example
df = LOAD ["s3://bucketname/partition1.parq", "s3://bucketname/partition2.parq"]
SELECT count(*) FROM df;
This would tell dask-distributed to load 2 files, one per worker, and perform SELECT on them in parallel.
Describe alternatives you've considered Current workaround is passing glob statement for dask. This works for some use cases, but not all of them.
Additional context Slack thread