dask-sql [QST] To what extent should Dask-SQL's output match PostgreSQL?

What is your question? Dask-SQL uses many workarounds in Dask/Pandas to return results that generally mirror PostgreSQL; these workarounds often have significant performance impacts. Some that I've worked on:

The custom sorting functions handle multiple ascending/descending booleans and null positions in a multi-column sort_values
get_groupby_with_null_cols is used to push null values to the front of a groupby agg result, where they would normally be last or omitted

As seen in https://github.com/dask-contrib/dask-sql/pull/273, removing workarounds where possible can greatly improve performance, though it comes with the cost of distancing Dask-SQL's behavior from that which is expected from PostgreSQL. With this in mind, I think it would be worthwhile to have a discussion about how much we should prioritize expected SQL behavior in comparison to performance.

As I see it, there are a couple ways that we can handle this question:

Value performance over behavior, always: this would entail removing workarounds altogether, and relaxing/removing tests that compare Dask-SQL's output to another SQL engine.
Value performance/behavior on a case by case basis: where possible, we can use default value kwargs to decide whether or not we want to use a more performant approach at the cost of SQL behavior; this works as a "best of both worlds" solution, but is limited in that we might not be able to pass kwargs through certain SQL queries. https://github.com/dask-contrib/dask-sql/pull/286 is exploring passing kwargs through the Context, but in many situations we would want kwargs at the Context.sql level.
Value behavior over performance, always: this would entail seeking out the most performant way to do the required workarounds, leaving tests intact.

I'm generally in favor of 2 in the long term, especially if we establish a way to pass kwargs to individual SQL queries. In the short term, however, I'm a little more hazy - since my background (along with various RAPIDS folk participating on this project) is in performant data science, I feel a little biased in my preference to value performance over behavior.

What do others think?

Nov 02 '21 19:11 charlesbluca

I think option 2 is the realistic long-term solution that pretty much all engines take, and it's a matter if deciding how to draw the lines. As someone from the outside looking in, my question would be whether we think of Dask-SQL as an implementation of PostgreSQL or as an implementation of SQL. If the latter, it might be helpful to look at how compliant other SQL engines are. This isn't a problem unique to us, and every dialect of SQL has its own quirks. Most engines also have a section somewhere in their documentation where they discuss their (non-)conformance, so that might also be an approach that we should take. That would probably be a lot less restrictive than trying to match PostgreSQL exactly.

Nov 02 '21 19:11 vyasr

Generally, I think people see a SQL project and assume initially that it implements "all of SQL".

For example, when we first started trying to integrate GPU support, devs here started by trying to do inserts, updates, and deletes, even though Dask (and Dask-SQL) don't try to be data storage applications, but rather a data processing engine.

Users learn either through testing a given SQL implementation or reading its documentation that there are differences from both ANSI SQL, and from other SQL implementations they've used in the past. So I think differences are expected and accepted, especially if documented.

That said, if a SQL implementation is missing a given important semantic (null first ordering, for example) they'll raise issues and feature requests where performance tradeoffs and configuration options can be discussed on case by case basis.

All that buildup above, to me, supports the idea that differing from "standard SQL" is fine. The existence of projects like Apache Hive, Apache Spark (SQL), etc. implies to me that existing data systems who's SQL implementations they mimic are missing something: in my view, performance and flexibility.

Thus, I also favor option 2, though lean a bit towards option 1, especially if a desired behavior is user controllable (see #288) with low maintenance burden for the Dask-SQL project.

I am curious about the reasons @nils-braun focused on PostgreSQL specifically. I wonder if PostgreSQL semantics were particularly important for early use cases?

Nov 03 '21 15:11 randerzander

Most engines also have a section somewhere in their documentation where they discuss their (non-)conformance, so that might also be an approach that we should take.

Agreed, I definitely think that documenting our non-conformance should make it more clear upfront to SQL users what this project offers (or doesn't).

I am curious about the reasons nils-braun focused on PostgreSQL specifically.

Same - I suspect that it may have just been a means of bootstrapping this project in the way that was most convenient at the time (in which case branching away from PostgreSQL shouldn't be much of an issue), but it would be nice to get some feedback from @nils-braun on this.

Nov 03 '21 20:11 charlesbluca

cc @goodwanghan in case you have any thoughts on this

Nov 09 '21 20:11 charlesbluca

dask-sql dask-sql copied to clipboard

[QST] To what extent should Dask-SQL's output match PostgreSQL?

dask-sql
dask-sql copied to clipboard