prql icon indicating copy to clipboard operation
prql copied to clipboard

Should `append` sometimes or always create `UNION ALL BY NAME`?

Open kgutwin opened this issue 9 months ago • 4 comments

What's up?

Currently a PRQL append transform will result in UNION ALL:

from tbl_a
append tbl_b

gives

SELECT * FROM tbl_a UNION ALL SELECT * FROM tbl_b -- Generated by PRQL compiler version:0.13.3-39-ge393ab4d (https://prql-lang.org)

However, this results in issues like #4724, #2680, and #3184, where the underlying cause is that UNION ALL is interpreted by the database as "unify by column position" rather than "unify by column name". See the DuckDB docs:

Traditional set operations unify queries by column position, and require the to-be-combined queries to have the same number of input columns. If the columns are not of the same type, casts may be added. The result will use the column names from the first query. DuckDB also supports UNION [ALL] BY NAME, which joins columns by name instead of by position. UNION BY NAME does not require the inputs to have the same number of columns. NULL values will be added in case of missing columns.

Questions:

  1. Should append always behave as UNION ALL BY NAME to simplify semantics from the user perspective, and also make the compiler's job easier? This would resolve all of the linked issues above without needing to dive into compiler details, but would be a breaking change for users expecting traditional UNION ALL behavior.
  2. If "no" to the above question, can we add a by:name or by:position argument to append to allow users to use UNION ALL BY NAME when that makes sense for their use case?

kgutwin avatar Feb 27 '25 21:02 kgutwin

Disclaimer: https://github.com/PRQL/prql/pull/5317 is my first contribution and I'm no SQL expert.
Here are my two cents, @kgutwin.

EDIT: I simplified my answer after digging all day 😄

It would indeed make sense to have by:name argument to let know PRQL that it can use the same column names inside the append, and therefore avoiding two select statement to manually order those columns.

That said, we probably don't need UNION ALL BY NAME for this. It can select those columns and work by position like always. For instance, in https://github.com/PRQL/prql/issues/4724, PRQL knows both columns. It mismatches by design, like UNION.

AFAIK DuckDB's UNION ALL BY NAME is an outlier here. I took a quick look and did not find this for most RDB/dialects. It seems to me we can assume most databases uses position paradigm as a foundation for their architecture and optimisations and won't implement union by name. So it's best not to use this syntax.

I don't know well how operator parameters enough to know how much works this is, though.

I agree a bit of documentation in the meantime would go a long way!

Hope that helps!

Fanaen avatar Jun 06 '25 08:06 Fanaen

Yeah, I agree that we don't want to assume that the SQL dialect supports UNION ALL BY NAME. Actually, my question is whether or not we can establish PRQL's semantics to always assume "by name" behavior when using append. And, if the answer is "no" (because there are cases where "by position" is either assumed or required), then giving the user the option to use "by name" semantics in the append call would be useful.

For clarity, my use case is to provide users with a general interface to working with tabular data, and one requirement is to be able to append data sets. I'd like for the append operator that we offer to default to a "by name" approach (because I think it's easier to rationalize) and while we could build that on our end, it would be nice for PRQL to be able to handle it.

kgutwin avatar Jun 09 '25 20:06 kgutwin

I understand. Thanks for the details.

My take on this is:

  1. no for the default
  2. yes for the opt-in parameter, I agree.

As far as I know, we can sum up the situation like this:

input mode database / dialect feasibility
implicit append (without select) by:name duckdb ✅ doable
implicit append (without select) by:name other dbs 🚫 impossible by design
implicit append (without select) by:position all ✅ doable
append + select by:name all ✅ doable
append + select by:position all ✅ doable

With an optional by:name parameter, the user may use it as long as :

  • EITHER they use a DB which supports UNION ALL BY NAME like duckdb
  • OR they explicit the columns with select on both side

Fanaen avatar Jun 10 '25 14:06 Fanaen

yes, great table

My take on this is:

1. no for the default

2. yes for the opt-in parameter, I agree.

sounds very reasonable!

max-sixty avatar Jun 10 '25 19:06 max-sixty