prql icon indicating copy to clipboard operation
prql copied to clipboard

Union operator

Open max-sixty opened this issue 3 years ago • 12 comments
trafficstars

This is currently lacking. Possibly it could be as simple as:

from employees
union
from managers

max-sixty avatar Jun 26 '22 04:06 max-sixty

What about other set operators? It seems like difference and intersect aren't discussed either.

zopieux avatar Jun 28 '22 23:06 zopieux

for the "union", we may need to consider both "union" and "union all".

Baoqi avatar Jul 01 '22 05:07 Baoqi

Hi,

I have been thinking about this have a proposal which may be a bit of a departure from SQL but which I think could get us some benefits.

SQL allows table expressions/subqueries in a number of places, like on either side of a JOIN, UNION and in (correlated) subqueries. Leaving the correlated subqueries aside for the moment, I think that part of the reason may be that these were available before the introduction of CTEs. Coming at it from a more modern perspective I question whether we still need that flexibility.

My understanding (mostly from @aljazerzen 's responses to issues rather any look a the compiler code) is that the PRQL compiler works by applying successive transformation functions to frames and producing new frames. By the sounds of it this follows the Fluent Interface data api pattern akin to how you would construct a pipeline in Pandas or Ibis.

I think this is quite a good choice and propose that we stick to this fluent interface API because I because it keeps the flow simple and linear. I also think that there is no loss of generality because what we're giving up can be achieved with CTEs.

So rather than

from employees
union
from managers

we have

table t2 = (
    from managers
)

from employees
union t2

or simply

from employees
union managers

snth avatar Jul 05 '22 07:07 snth

The way I think of it is that you essentially have one main pipeline and any other sub-pipelines that you require you need to factor out into table expressions/CTEs.

I think this is also a place where we could innovate a bit over SQL and introduce some additional ergonomics. For example one nuisance with UNIONs is that you have to have compatible columns. With some knowledge of the table and column types we could generate unions or intersections of column names automatically.

table t1 = (
   from X
   select a, b
)

table t2 = (
   from X
   select b, c
)

from t1
union all:true columns:intersection t2

could be translated to

(SELECT b FROM t1)
UNION ALL
(SELECT b from t2)

and

table t1 = (
   from X
   select a, b
)

table t2 = (
   from X
   select b, c
)

from t1
union all:false columns:union t2

could be translated to

(SELECT a, b, NULL as c FROM t1)
UNION
(SELECT NULL as a, b, c from t2)

Lastly, perhaps the columns parameter could also take an explicit list of column names, like

from employees_america
union all:false columns:[uuid] employees_europe employees_asia employees_africa

gets transpiled to

(SELECT uuid FROM employees_america)
UNION
(SELECT uuid FROM employees_europe)
UNION
(SELECT uuid FROM employees_asia)
UNION
(SELECT uuid FROM employees_africa)

This might be useful for unioning together partitioned tables, especially if the table identifiers could be generated with some sort of glob or regex expression.

Perhaps we should rather apply the YAGNI principle and wait until someone actually requests these extensions. I don't have a use case for them personally myself. We do need to have a union operator though with some priority, as it's needed for recursive CTEs (see #407). The recursive CTE requirement might need some refinement of this union proposal but I'll address that separately.

snth avatar Jul 05 '22 09:07 snth

@zopieux I think intersect and difference can be achieved with sem joins and anti joins. I'm not sure how widely supported those are but there are correlated subquery patterns with EXISTS that can work around this. This will require backend adaptor specific code though so I think we should rather address this in a separate issue in order to keep the union implementation simple. I will open an issue for those.

snth avatar Jul 05 '22 09:07 snth

https://github.com/prql/prql/issues/656#issuecomment-1174713630 has some good suggestions.

The UNION operation is fairly different from most operations in SQL, so it makes sense that it feels a bit foreign.

I don't have strong views between

from employees
union
from managers

and

from employees
union managers

...probably with a mild preference for the second.

Ideally we can still have an inline pipeline with

from employees
union (
  from managers
  derive x = foo
)

I think we can have a single positional parameter for all / distinct / intersect (which means a bare union won't work).

Does that seem like a reasonable first implementation? (Without having strong views on the more advanced suggestions in the later comments)

max-sixty avatar Jul 07 '22 03:07 max-sixty

Just realizing that union would be a function that can take two tables as arguments, which means that it can also be called as:

union (
  from employees
) (
  from managers 
  derive x = foo
)

or more concise:

union (from employees) (from managers | derive x = foo)

aljazerzen avatar Jul 07 '22 19:07 aljazerzen

Do you think it should be more a function that takes two parameters[^1] — or like a transform like join is, connecting the pipeline above?

[^1]: or a list

max-sixty avatar Jul 07 '22 21:07 max-sixty

I'm saying that these two things are the same :smile: !

from employees
union (from managers | derive x = foo)

... is equivalent to:

union (from managers | derive x = foo) (from employees)

aljazerzen avatar Jul 08 '22 06:07 aljazerzen

Yes, great point!

Technically, we can't then have a list; we'd need exactly two parameters (or to change how we do parameters).

max-sixty avatar Jul 08 '22 06:07 max-sixty

That's true and its a shame. I'm guessing that there would be many uses-cases for accepting a list...

aljazerzen avatar Jul 08 '22 08:07 aljazerzen

Yeah, I think we can iterate on that along with https://github.com/prql/prql/issues/648 around generalizing over single items & lists.

I can imagine some syntheses of these — e.g. the final argument of a function can take multiple items as arguments and they're converted to a list (though not sure if that's a good idea tbc)

max-sixty avatar Jul 08 '22 17:07 max-sixty

Closed by #894

Related work continues on #761

aljazerzen avatar Jan 09 '23 09:01 aljazerzen