prql
prql copied to clipboard
Should `append` sometimes or always create `UNION ALL BY NAME`?
What's up?
Currently a PRQL append transform will result in UNION ALL:
from tbl_a
append tbl_b
gives
SELECT * FROM tbl_a UNION ALL SELECT * FROM tbl_b -- Generated by PRQL compiler version:0.13.3-39-ge393ab4d (https://prql-lang.org)
However, this results in issues like #4724, #2680, and #3184, where the underlying cause is that UNION ALL is interpreted by the database as "unify by column position" rather than "unify by column name". See the DuckDB docs:
Traditional set operations unify queries by column position, and require the to-be-combined queries to have the same number of input columns. If the columns are not of the same type, casts may be added. The result will use the column names from the first query. DuckDB also supports UNION [ALL] BY NAME, which joins columns by name instead of by position. UNION BY NAME does not require the inputs to have the same number of columns. NULL values will be added in case of missing columns.
Questions:
- Should
appendalways behave asUNION ALL BY NAMEto simplify semantics from the user perspective, and also make the compiler's job easier? This would resolve all of the linked issues above without needing to dive into compiler details, but would be a breaking change for users expecting traditionalUNION ALLbehavior. - If "no" to the above question, can we add a
by:nameorby:positionargument toappendto allow users to useUNION ALL BY NAMEwhen that makes sense for their use case?
Disclaimer: https://github.com/PRQL/prql/pull/5317 is my first contribution and I'm no SQL expert.
Here are my two cents, @kgutwin.
EDIT: I simplified my answer after digging all day 😄
It would indeed make sense to have by:name argument to let know PRQL that it can use the same column names inside the append, and therefore avoiding two select statement to manually order those columns.
That said, we probably don't need UNION ALL BY NAME for this. It can select those columns and work by position like always. For instance, in https://github.com/PRQL/prql/issues/4724, PRQL knows both columns. It mismatches by design, like UNION.
AFAIK DuckDB's UNION ALL BY NAME is an outlier here. I took a quick look and did not find this for most RDB/dialects. It seems to me we can assume most databases uses position paradigm as a foundation for their architecture and optimisations and won't implement union by name. So it's best not to use this syntax.
I don't know well how operator parameters enough to know how much works this is, though.
I agree a bit of documentation in the meantime would go a long way!
Hope that helps!
Yeah, I agree that we don't want to assume that the SQL dialect supports UNION ALL BY NAME. Actually, my question is whether or not we can establish PRQL's semantics to always assume "by name" behavior when using append. And, if the answer is "no" (because there are cases where "by position" is either assumed or required), then giving the user the option to use "by name" semantics in the append call would be useful.
For clarity, my use case is to provide users with a general interface to working with tabular data, and one requirement is to be able to append data sets. I'd like for the append operator that we offer to default to a "by name" approach (because I think it's easier to rationalize) and while we could build that on our end, it would be nice for PRQL to be able to handle it.
I understand. Thanks for the details.
My take on this is:
- no for the default
- yes for the opt-in parameter, I agree.
As far as I know, we can sum up the situation like this:
| input | mode | database / dialect | feasibility |
|---|---|---|---|
| implicit append (without select) | by:name | duckdb | ✅ doable |
| implicit append (without select) | by:name | other dbs | 🚫 impossible by design |
| implicit append (without select) | by:position | all | ✅ doable |
| append + select | by:name | all | ✅ doable |
| append + select | by:position | all | ✅ doable |
With an optional by:name parameter, the user may use it as long as :
- EITHER they use a DB which supports
UNION ALL BY NAMElike duckdb - OR they explicit the columns with
selecton both side
yes, great table
My take on this is:
1. no for the default 2. yes for the opt-in parameter, I agree.
sounds very reasonable!