prql
prql copied to clipboard
`if` or `case` expressions
I'm thinking from the language POV what might be the biggest gaps, since we're so close to 0.2.
Something we don't yet have in PRQL is if
or case when then else end
expressions.
This might be the biggest gap in the language atm. Should we do anything on this for 0.2 or leave them? (adding a 0.2 tag for the sake of the question).
A few options:
- A tertiary expression, like
x ? y : z
(probably not with a colon though) - An
if
function, likex | if y z
— though written this way it would need to have a signaturef y z x
, maybe not conventional - A case pipeline like (without having thought about it sufficiently):
case (
x -> y
_ -> z
)
- A match pipeline like (without having thought about it sufficiently):
match a (
b -> c
d -> e
_ -> f
)
I'll close https://github.com/prql/prql/issues/50 in favor of this.
Malloy uses pick
for these:
size ?
pick 'small' when < 10
pick 'medium' when < 20
else 'large'
Tangentially, something nice in Malloy is "partial comparison"; the pick
becomes an expression rather than a statement; or another way of saying this in that it curries size ?
. So the above is equivalent to:
pick 'small' when size < 10
pick 'medium' when size < 20
else 'large'
ref #71
@max-sixty Is there any workaround for this currently? Have you had any further thoughts on what direction you'd like to go?
It seems like an if
expression might be nice:
select [
property = if x { if y > 200 { 'foo' } else { 'bar' } } else { 'baz' }
]
Seems like you can use S-strings to at least get something working and have the benefit of reusing definitions:
# in the employees sample
derive [
tenure_or_zero = tenure ?? 0,
tenure_type = s"""IF {tenure_or_zero} > 4 THEN 'senior' ELSE IF {tenure_or_zero} > 10 'legend' ELSE 'normal' END"""
]
generates:
IF COALESCE(newest_employees.tenure, 0) > 4 THEN 'senior'
ELSE IF COALESCE(newest_employees.tenure, 0) > 10 'legend'
ELSE 'normal'
END AS tenure_type
Yes, the s-string is the classic escape hatch. But I think this is one of the few areas that's really missing from the language for it to be broadly useful without hacks, and it's worth us adding it next.
What do you think about the options above? Some of them are quite easy to do — e.g. a tertiary. The match
& case
are cooler, but would require more thought.
(we don't have a precise notion of blocks — e.g. if y > 200 { 'foo' }
— but we can use functions and parentheses to do something similar)
I agree that this is one of the major missing features at the moment. I hit this as soon as I tried to implement the first example from the dbt documentation in PRQL. Big vote in favour of making this part of the 0.3 milestone.
I also think it's important enough to consider well. I'm not well versed enough in the benefits of match
vs case
vs if expressions
but I am strongly against tertiary operator\expressions
. If we had a Quadratic Voting procedure I'd use a lot of my budget on this. I find tertiary expressions very hard to read. I believe one of the largest bikeshedding discussions for Prettier was around the best format for nested tertiaries and when I saw how these get used in Javascript/Typescript, I nearly fell off my chair.
In my opinion, one of the key reason behind Python's success and popularity is its readability, and for me using and \ or
instead of && \ ||
and expr_a if cond else expr_b
instead of cond ? expr_a : expr_b
are a large part of that. We also strive for terseness but I would put readability first. I think similarly SQL has been relatively easy to pick up for non-programmers due to its readability but it steered too far away from terseness (as well as composability and orthogonality).
Edit: Not saying I'm necessarily in favour of the Python ordering of A if cond else B
and would be happy with if cond then A else B
but I strongly prefer keywords over symbols in this case.
@aljazerzen WDYT about the case
& match
expression / pipelines?
I think they could be good, but they would extend our syntax a bit, and so are worth considering carefully.
For match, would this work:
match foo (
in 2..5 -> "low"
in 6..8 -> "medium"
== 10 -> "high"
_ -> "invalid"
)
It's kinda a pipeline, kinda a match expression using guards.
We could also use a list rather than a pipeline.
I hear the point about tertiary expressions being difficult to read, especially when compounded, but it does fit with the language.
Basically everything in prql is an expression rather than a statement, and so an if foo then bar else baz
doesn't really fit. We can extend the language to fit it — like we would for match
, but there's some downside.
Sorry, I should have been careful in what I was saying and it's really the tertiary operator ? :
rather than a tertiary expression
that I am against.
>>> (lambda x: 1 if x>0 else -1 if x<0 else 0)(-5)
-1
works in Python so you could well have a tertiary expression just using if \ then \ else
keywords rather than the ? :
symbols.
I agree with all said:
-
tertiary operator is indeed hard to read and (in my opinion) archaic,
-
match
would require us to extend the syntax, but I like syntax max proposed (let's call itoption 1
), -
if
could be implemented as a simple function:(option 2) select cost = (if is_promo 0 total)
... but that may not be verbose enough. If we add
then
andelse
:(option 3) select cost = (if is_promo then 0 else total)
This may still be parsable as a normal function call. I'd like to add that I think we should avoid Python-like
A if cond else B
in favor ofif cond then A else B
, because in my opinion, having condition in the middle is confusing, since all otherif
statements and tertiary operators I know have the condition at the start.
To be a complete language, I think we need both if
and match
, because if
is convenient for quick null checks, while match
is needed for mapping enums and such. Maybe we could do without match
if we would go with option 2
.
We could add an if
function quickly — my one concern is that if test when_true when_false
doesn't pipe well, since piping puts the item being piped last. So this looks reasonable:
if (x>5) "high" "low"
but piping doesn't work
(x>5) | if "high" "low" # equivalent to: if "high" "low" (x>5)
The alternative is probably a tertiary expression / operator!
(x>5) ? "high" "low" # equivalent to: if (x>5) "high" "low"
Then match
/ case
is important but a bigger decision — e.g. do we need both; are the conditions a pipeline or a list; do we need both; can we offer exhaustive matching.
If there's consensus on the if
func, we can go ahead with that?
So no (:
else operator ) in the above proposal ?
Proposed
(x>5) ? "high" "low
I like piping to ? , but think it feels odd without the :
If we have a full featured match I think this would only be used for inline if/elses like most modern use cases.
So no (
:
else operator ) in the above proposal ?
Depends whether this is a standard function or new syntax. Probably we don't want a :
because we use that already, I'm completely fine with something there though...
Interesting. @max-sixty I didn't really get your if
function before but I think I got it now.
I quite like the idea of ?
as an infix indexing operator, like the APL ⊃ Pick
function (ignoring the nesting), which in a sense generalises the if
function beyond the [0, 1]
range. You have the slight complication that we usually write the then
expression (corresponding to index 1
) before the else
expression (corresponding to index 0
). Perhaps that could be solved with a thenelse
function which takes two arguments and turns that into an array / list with the order reversed?
I did a small Python prototype to illustrate how I see this working. There's no ?
operator in Python so I used ^
instead for the example.
class Choose:
"Implements ^ Choose operator for lists"
def __init__(self, choices):
self.choices = choices
def __rxor__(self, choice):
return dict(enumerate(self.choices)).get(choice)
def __repr__(self):
return f"{self.__class__.__name__}({self.choices})"
>>> print(f"{0 ^ Choose(['a', 'b', 'c'])=}")
0 ^ Choose(['a', 'b', 'c'])='a'
>>>print(f"{1 ^ Choose(['a', 'b', 'c'])=}")
1 ^ Choose(['a', 'b', 'c'])='b'
>>>print(f"{2 ^ Choose(['a', 'b', 'c'])=}")
2 ^ Choose(['a', 'b', 'c'])='c'
The thenelse
function wraps values in a list and reverses the order so they work with the ^
operator
def thenelse(then_, else_):
return Choose([else_, then_])
>>> print(f"{thenelse('a', 'b')=}")
thenelse('a', 'b')=Choose(['b', 'a'])
>>> print(f"{True ^ thenelse('a', 'b')=}")
True ^ thenelse('a', 'b')='a'
>>> print(f"{False ^ thenelse('a', 'b')=}")
False ^ thenelse('a', 'b')='b'
Finally an example of how you might want to use this:
import pandas as pd
df = pd.DataFrame({'a':range(4)})
df['b'] = df.a.apply(lambda a: (a > 1) ^ thenelse('high', 'low'))
df['c'] = df.a.apply(lambda a: a ^ Choose(['one', 'two', 'three']))
df
a b c
0 0 low one
1 1 low two
2 2 high three
3 3 high None
So the proposed PRQL syntax could be
(x>5) ? thenelse "high" "low
and what you would gain from complicating things with the thenelse
is that you could also do things like
2 ? ['one' 'two' 'three']
The Python example generalises quite nicely to a more general Pick
function / operator:
class Pick:
def __init__(self, *args, **kwds):
self.choices = {**dict(enumerate(args)), **kwds}
def __rxor__(self, choice):
return self.choices.get(choice)
def __repr__(self):
return f"{self.__class__.__name__}({self.choices})"
>>> print(f"{True ^ Pick('falsy', 'truthy', 'two', a='A', b='B')=}")
True ^ Pick('falsy', 'truthy', 'two', a='A', b='B')='truthy'
>>> print(f"{False ^ Pick('falsy', 'truthy', 'two', a='A', b='B')=}")
False ^ Pick('falsy', 'truthy', 'two', a='A', b='B')='falsy'
>>> print(f"{0 ^ Pick('falsy', 'truthy', 'two', a='A', b='B')=}")
0 ^ Pick('falsy', 'truthy', 'two', a='A', b='B')='falsy'
>>> print(f"{1 ^ Pick('falsy', 'truthy', 'two', a='A', b='B')=}")
1 ^ Pick('falsy', 'truthy', 'two', a='A', b='B')='truthy'
>>> print(f"{2 ^ Pick('falsy', 'truthy', 'two', a='A', b='B')=}")
2 ^ Pick('falsy', 'truthy', 'two', a='A', b='B')='two'
>>> print(f"{'a' ^ Pick('falsy', 'truthy', 'two', a='A', b='B')=}")
'a' ^ Pick('falsy', 'truthy', 'two', a='A', b='B')='A'
I don't know how any of this would be implemented on the transpiled SQL side though.
Thanks a lot for the excellent examples and APL reference @snth ! Very happy to have more inspiration from APL, please keep supplying it...
(also I didn't know about the f-string f"{x=}"
functionality, that's really useful, kinda like dbg!
in rust)
I personally quite like the generalization from the tertiary operator to the "pick" / "index" function. A couple of tradeoffs:
- It's less intuitive to have
test <op> false true
thantest <op> true false
, and wrapping the list in athenelse
is burdensome. Though we could always implementif
in terms of this. - While I like APL a lot, sometimes it's easier to write than read, and this may be one of those places. Possibly we want something like
if
for very short conditions, or a full explicitmatch
for longer ones. But very open-minded! - (very minor) SQL int / bool often require an explicit cast
Yeah I agree with you and I'm not really in favour of my pick / index function idea myself as I don't think it's very readable and doesn't naturally map onto any underlying SQL concepts. It was mostly a case of me having the realisation that if then else
was a special case of array indexing with 0 and 1 and letting the idea run to see where it would lead. It makes sense for APL where arrays are front and center but that's not the case for SQL so it's not a good fit here.
Coming back to case / match
, which I think is really the first prize here that we need to aim for, I prefer sticking closer to the SQL terms as I think PRQL's primary appeal will be for people looking for a more concise and consistent SQL, at least for a time to come. Therefore I would favour
case (
a -> b
c -> d
_ -> e
)
over the equivalent match
.
From what I can tell, the semantic difference between case
and match
(and I missed this the first time around because my eye skipped over the single letter), is that case
allows arbitrary expressions in each clause while match
matches a particular expression against conditions.
So with case
you can do
select status = case [
age > 65 -> "retired",
is_suspended -> "suspended",
_ -> "working"
]
which I don't think you could do with a match
.
Also, case
maps onto CASE
in SQL in a straightforward manner, so apart from the language syntax extension it should be simple to implement. I imagine pattern matching around match
might be more complicated (unless you're just doing simple substitutions of the match_expr
as in the Malloy pick
example).
In terms of how this would be implemented, in SQL there is no if
, so a if cond then A else B
would really just be syntactic sugar for
case (
cond -> A
_ -> B
)
. So I just want to raise the question of whether we need it?
I guess an answer to that would be that for inline expressions, the if then else
syntax might be more convenient and readable. So taking @aljazerzen 's examples
(option 3)
select cost = (if is_promo then 0 else total)
might read better than
(option 1, or should it be 4?)
select cost = case [is_promo -> 0, _ -> total]
]
Or does it? I think I'm actually finding that case
example easier to read.
Somewhat off topic but related, what are our syntax rules for ()
vs []
? @max-sixty your initial case
example used ()
and it's something that I've noticed in the PRQL book as well that I don't really understand properly.
In the following example from the PRQL book, why does the group
take ()
while the aggregate takes []
?
from employees
group [title, country] (
aggregate [
average salary,
ct = count
]
)
I think that case
is the way to go. You are quite convincing :)
I think that in your example, if then else
is more readable, but that can be added later as syntactic sugar if there will be enough votes to do so.
Regarding ()
vs []
:
-
[]
are used for arrays (many elements of the same type, order matters), -
()
are for pipelines (many functions, applied one onto another). In special case of only one element, this is equivalent to just that element (just as normal parenthesized expression). Zero elements would be equivalent to null?
So group needs parentheses because it needs a pipeline - a function that is applied to each group.
Aggregate needs a list of expressions (order matters), so thats an array.
With case order also matters and I dont think we could shape this to be a pipeline of functions, so I'd say it should be an array of "conditions".
Just pinging this issue since it was mentioned in https://github.com/prql/prql/discussions/1145#discussioncomment-4186499.
I see this as quite a priority as CASE
statements are needed for many things in SQL. I think we need this ready at least by the time of our conference talks in March next year. For example translating the first example from the dbt docs is still blocked by this feature. Perhaps something we can look at once the semantic branch is done?
I think we can close this in favor of #1286 & #1332!