dask-sql
dask-sql copied to clipboard
[DF] select * limit 5 seems does a full scan
I'm struggling to find a programmatic reproducer for this, but on the datafusion-sql-planner branch:
c.sql("SELECT * FROM large_table limit 5")
results in reading the entire dataset before filtering at the end, instead of reading from a single partition.
Less reproducible, but from the daily weather data:
res = c.sql("select * from weather limit 5")
io_layer = list(res.dask.layers.keys())[0]
partitions = len(list(res.dask.layers[io_layer].keys()))
partitions
445
If I understand layers correctly, my select w/ limit statement is reading all partitions in the dataset.