[QST] Should dask-sql maintain compatibility on independent clusters?
What is your question?
One advertised feature of dask-sql is its ability to compute queries on relatively bare Dask clusters - to back this claim up, we run a majority of our tests on an independent Dask cluster started in a Docker container https://github.com/dask-contrib/dask-sql/blob/main/.github/docker-compose.yaml, with little added to the base environment outside of a bump to the pandas version.
However, through work being done by @ayushdg on #398, we now know that:
- The independent cluster tests quietly end up falling back on a local Dask cluster with access to dask-sql's CI environment, meaning we haven't been covering independent cluster usage for some time
- When tests do actually run on an independent cluster, some fail with
ModuleNotFoundErrors; for example,test_fsqlfails because it requirestriadto be installed
This has led me to begin questioning if we should continue running these tests in the first place - is it a common usage pattern within Dask to compute tasks on a cluster with a minimal environment? Perhaps this isn't really something worth the dev effort to maintain as a feature if not?
If we do opt to continue maintaining this feature, we would probably want to:
- make changes to the cluster CI so that tests can verify that they're running on the independent cluster, so that breakage here won't happen silently
- scope out the changes that need to be made so that tests like
test_fsqldon't requiretriadon their computing cluster
On the other hand, if we wanted to remove it, some important things to do:
- make sure any references to this in the docs are removed
- potentially revert changes made to enable this feature (for example, the use of
make_pickable_without_dask_sql) - remove CI running on independent clusters (though it might be useful to have CI running on a cluster in general)
Interested to hear other's thoughts on this.
cc @quasiben @randerzander
It feels worthwhile to try to allow Dask-SQL to work as a client-only library, in particular since it presently involves a fairly large JDK component.
If, on further investigation, you find that it's a big effort to support that, I could be convinced otherwise, especially if Dask-SQL's packaging can be simplified such that only the client node needs the JDK
I would suggest we keep it very minimal, we do want to validate it works as a client-only library but changing / testing the existence of third party libraries seems out of scope. Dask clusters typically assume a uniform environment and we should be safe making the same assumptions here
I think maintaining dask-sql as a client-only library makes sense and shouldn't be too difficult to continue (right now, that is still the case, there are just some optional dependencies that may be required in the cluster to run certain queries). My question is more directed at if it's worth it to go a step further and also verify that queries run successfully on a cluster with almost none of the dependencies of dask-sql, which seems like a more niche case, though @quasiben's response seems to bring some clarity to that:
changing / testing the existence of third party libraries seems out of scope
In that case, perhaps the solution we have right now is ideal, i.e. skipping / xfailing tests on the independent cluster if we know they will require some additional modules and generally just making sure that a relevant subset of the tests are able to pass on a cluster without dask-sql installed; I'm a little interested in if we could reflect this somehow in the docs ("dask-sql queries can typically run on a minimal cluster, but if you want to use ABC feature you'll also need to have XYZ package").
If we maintain independent cluster testing, we'll also want to add in some basic sanity checks to verify that we're actually using the cluster, since that's what led to this situation in the first place (cc @ayushdg as I recall you having some ideas on how we could do this).
dask-sql queries can typically run on a minimal cluster, but if you want to use ABC feature you'll also need to have XYZ package").
Do we presently know which features that includes? Or are you suggesting we need to do testing work to build such a list?
Closing this as we've generally followed suit with Dask in pushing for a homogeneous client/cluster environment.