dask-sql icon indicating copy to clipboard operation
dask-sql copied to clipboard

[DF] Any query containing ORDER BY immediately executes

Open randerzander opened this issue 3 years ago • 1 comments

import pandas as pd
from dask_sql import Context

c = Context()
df = pd.DataFrame({"id": [0, 1, 2]})
c.create_table("df", df)

# returns a DataFrame
c.sql("select * from df")

# returns a DataFrame _and_ immediately executes the DAG
c.sql("select * from df order by id")

randerzander avatar Jul 13 '22 18:07 randerzander

This is by design where we persist the dataframe before sorting and execute part of the dag before this op. https://github.com/dask-contrib/dask-sql/blob/main/dask_sql/physical/rel/logical/sort.py#L37

The rationale behind this is that order by is often one of the last operations processed in a query, and not persisting the data usually results in dask executing the whole query leading up to the sort twice. Once during dag generation to compute divisions, and once after when actually persisting/computing. To prevent the duplicate compute, dask-sql persists before sorting.

If you prefer being able to configure this behavior regardless of performance implications, one option could be to add a config option to customize this behavior.

@charlesbluca Curious if you have thoughts here

ayushdg avatar Jul 13 '22 18:07 ayushdg