[BUG] `dask-sql` fails to import with `dask-expr` enabled
What happened:
dask-sql fails to import when trying to use it with dask-expr because dask-expr does not implement dd.Aggregation, which is subclassed in multiple places. We plan to use dask-expr as the default for the dask.dataframe API soon (see discussion issue here: https://github.com/dask/dask/issues/10934), so this will cause problems in the future.
Minimal Complete Verifiable Example:
$ export DASK_DATAFRAME__QUERY_PLANNING=True
$ python
>>> import dask_sql
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/homebrew/Caskroom/mambaforge/base/envs/tpch/lib/python3.11/site-packages/dask_sql/__init__.py", line 8, in <module>
from .cmd import cmd_loop
File "/opt/homebrew/Caskroom/mambaforge/base/envs/tpch/lib/python3.11/site-packages/dask_sql/cmd.py", line 26, in <module>
from dask_sql.context import Context
File "/opt/homebrew/Caskroom/mambaforge/base/envs/tpch/lib/python3.11/site-packages/dask_sql/context.py", line 43, in <module>
from dask_sql.physical.rel import RelConverter, custom, logical
File "/opt/homebrew/Caskroom/mambaforge/base/envs/tpch/lib/python3.11/site-packages/dask_sql/physical/rel/logical/__init__.py", line 1, in <module>
from .aggregate import DaskAggregatePlugin
File "/opt/homebrew/Caskroom/mambaforge/base/envs/tpch/lib/python3.11/site-packages/dask_sql/physical/rel/logical/aggregate.py", line 24, in <module>
class ReduceAggregation(dd.Aggregation):
TypeError: function() argument 'code' must be code, not str
Environment:
- dask-sql version:
main(https://github.com/dask-contrib/dask-sql/commit/93bb1e5bfc611a5987ed26af23cc80c5b0a23324)
Update: The problem mentioned above has been resolved via https://github.com/dask/dask/pull/10947 and https://github.com/dask-contrib/dask-expr/pull/893.
There is now a new error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/homebrew/Caskroom/mambaforge/base/envs/tpch/lib/python3.11/site-packages/dask_sql/__init__.py", line 8, in <module>
from .cmd import cmd_loop
File "/opt/homebrew/Caskroom/mambaforge/base/envs/tpch/lib/python3.11/site-packages/dask_sql/cmd.py", line 26, in <module>
from dask_sql.context import Context
File "/opt/homebrew/Caskroom/mambaforge/base/envs/tpch/lib/python3.11/site-packages/dask_sql/context.py", line 43, in <module>
from dask_sql.physical.rel import RelConverter, custom, logical
File "/opt/homebrew/Caskroom/mambaforge/base/envs/tpch/lib/python3.11/site-packages/dask_sql/physical/rel/logical/__init__.py", line 5, in <module>
from .filter import DaskFilterPlugin
File "/opt/homebrew/Caskroom/mambaforge/base/envs/tpch/lib/python3.11/site-packages/dask_sql/physical/rel/logical/filter.py", line 11, in <module>
from dask_sql.physical.utils.filter import attempt_predicate_pushdown
File "/opt/homebrew/Caskroom/mambaforge/base/envs/tpch/lib/python3.11/site-packages/dask_sql/physical/utils/filter.py", line 309, in <module>
M.fillna: dd._Frame.fillna,
^^^^^^^^^^^^^^^^
AttributeError: 'function' object has no attribute 'fillna'
@charlesbluca , @rjzamora: Fixing the usage of dd._Frame shouldn't be much work. Are you aware of any other issues (e.g., other internals dask-sql relies upon) that will get in the way of dask-sql supporting dask-expr? I'm trying to scope the required effort here.
Thanks for scoping this out @hendrikmakait !
My understanding is that dask-sql uses a lot of custom HLG-optimization code to implement predicate pushdown. If dask-sql can use simpler dask-expr logic for this, then most of that code can probably go away (need to take a careful look soon to say for sure). I know @phofl found that the TPCh benchmarks were a bit faster without predicate pushdown, but it would be nice if users/dask-sql could still "opt in".
@charlesbluca @ayushdg - How high of a priority is it to migrate dask-sql's predicate-pushdown logic to dask-expr? It shouldn't be too much work, but it would also be good to know if simply disabling predicate pushdown is a reasonable temporary workaround?
FYI, we have now flipped the switch and dask-expr has become the default dataframe engine (https://github.com/dask/dask/pull/10967).
Closed with #1319