fugue icon indicating copy to clipboard operation
fugue copied to clipboard

[FEATURE] Enable loading of list of files into dask backend

Open inc0 opened this issue 2 years ago • 0 comments

Is your feature request related to a problem? Please describe. Dasks read_parquet (as well as other loaders, like csv) can accept list of files. This allows multiple dask distributed workers to load partition of data each, without data ever be loaded into client memory in it's entirety. This is important for datasets that can't fit into single machines memory.

Fugue SQL supports LOAD statement that passes this to dask backend. This statement, unfortunately, doesn't allow for list of files.

Describe the solution you'd like Allow loading of multiple parquet files by passing a list. For example

df = LOAD ["s3://bucketname/partition1.parq", "s3://bucketname/partition2.parq"]
SELECT count(*) FROM df;

This would tell dask-distributed to load 2 files, one per worker, and perform SELECT on them in parallel.

Describe alternatives you've considered Current workaround is passing glob statement for dask. This works for some use cases, but not all of them.

Additional context Slack thread

inc0 avatar Oct 25 '22 21:10 inc0