prql
prql copied to clipboard
Union operator
This is currently lacking. Possibly it could be as simple as:
from employees
union
from managers
What about other set operators? It seems like difference and intersect aren't discussed either.
for the "union", we may need to consider both "union" and "union all".
Hi,
I have been thinking about this have a proposal which may be a bit of a departure from SQL but which I think could get us some benefits.
SQL allows table expressions/subqueries in a number of places, like on either side of a JOIN, UNION and in (correlated) subqueries. Leaving the correlated subqueries aside for the moment, I think that part of the reason may be that these were available before the introduction of CTEs. Coming at it from a more modern perspective I question whether we still need that flexibility.
My understanding (mostly from @aljazerzen 's responses to issues rather any look a the compiler code) is that the PRQL compiler works by applying successive transformation functions to frames and producing new frames. By the sounds of it this follows the Fluent Interface data api pattern akin to how you would construct a pipeline in Pandas or Ibis.
I think this is quite a good choice and propose that we stick to this fluent interface API because I because it keeps the flow simple and linear. I also think that there is no loss of generality because what we're giving up can be achieved with CTEs.
So rather than
from employees
union
from managers
we have
table t2 = (
from managers
)
from employees
union t2
or simply
from employees
union managers
The way I think of it is that you essentially have one main pipeline and any other sub-pipelines that you require you need to factor out into table expressions/CTEs.
I think this is also a place where we could innovate a bit over SQL and introduce some additional ergonomics. For example one nuisance with UNIONs is that you have to have compatible columns. With some knowledge of the table and column types we could generate unions or intersections of column names automatically.
table t1 = (
from X
select a, b
)
table t2 = (
from X
select b, c
)
from t1
union all:true columns:intersection t2
could be translated to
(SELECT b FROM t1)
UNION ALL
(SELECT b from t2)
and
table t1 = (
from X
select a, b
)
table t2 = (
from X
select b, c
)
from t1
union all:false columns:union t2
could be translated to
(SELECT a, b, NULL as c FROM t1)
UNION
(SELECT NULL as a, b, c from t2)
Lastly, perhaps the columns parameter could also take an explicit list of column names, like
from employees_america
union all:false columns:[uuid] employees_europe employees_asia employees_africa
gets transpiled to
(SELECT uuid FROM employees_america)
UNION
(SELECT uuid FROM employees_europe)
UNION
(SELECT uuid FROM employees_asia)
UNION
(SELECT uuid FROM employees_africa)
This might be useful for unioning together partitioned tables, especially if the table identifiers could be generated with some sort of glob or regex expression.
Perhaps we should rather apply the YAGNI principle and wait until someone actually requests these extensions. I don't have a use case for them personally myself. We do need to have a union operator though with some priority, as it's needed for recursive CTEs (see #407). The recursive CTE requirement might need some refinement of this union proposal but I'll address that separately.
@zopieux I think intersect and difference can be achieved with sem joins and anti joins. I'm not sure how widely supported those are but there are correlated subquery patterns with EXISTS that can work around this. This will require backend adaptor specific code though so I think we should rather address this in a separate issue in order to keep the union implementation simple. I will open an issue for those.
https://github.com/prql/prql/issues/656#issuecomment-1174713630 has some good suggestions.
The UNION operation is fairly different from most operations in SQL, so it makes sense that it feels a bit foreign.
I don't have strong views between
from employees union from managers
and
from employees union managers
...probably with a mild preference for the second.
Ideally we can still have an inline pipeline with
from employees
union (
from managers
derive x = foo
)
I think we can have a single positional parameter for all / distinct / intersect (which means a bare union won't work).
Does that seem like a reasonable first implementation? (Without having strong views on the more advanced suggestions in the later comments)
Just realizing that union would be a function that can take two tables as arguments, which means that it can also be called as:
union (
from employees
) (
from managers
derive x = foo
)
or more concise:
union (from employees) (from managers | derive x = foo)
Do you think it should be more a function that takes two parameters[^1] — or like a transform like join is, connecting the pipeline above?
[^1]: or a list
I'm saying that these two things are the same :smile: !
from employees
union (from managers | derive x = foo)
... is equivalent to:
union (from managers | derive x = foo) (from employees)
Yes, great point!
Technically, we can't then have a list; we'd need exactly two parameters (or to change how we do parameters).
That's true and its a shame. I'm guessing that there would be many uses-cases for accepting a list...
Yeah, I think we can iterate on that along with https://github.com/prql/prql/issues/648 around generalizing over single items & lists.
I can imagine some syntheses of these — e.g. the final argument of a function can take multiple items as arguments and they're converted to a list (though not sure if that's a good idea tbc)
Closed by #894
Related work continues on #761