prql `if` or `case` expressions

`if` or `case` expressions

Open max-sixty opened this issue 2 years ago • 17 comments

I'm thinking from the language POV what might be the biggest gaps, since we're so close to 0.2.

Something we don't yet have in PRQL is if or case when then else end expressions.

This might be the biggest gap in the language atm. Should we do anything on this for 0.2 or leave them? (adding a 0.2 tag for the sake of the question).

A few options:

A tertiary expression, like x ? y : z (probably not with a colon though)
An if function, like x | if y z — though written this way it would need to have a signature f y z x, maybe not conventional
A case pipeline like (without having thought about it sufficiently):

case (
  x -> y
  _ -> z
)

A match pipeline like (without having thought about it sufficiently):

match a (
  b -> c
  d -> e
  _ -> f
)

I'll close https://github.com/prql/prql/issues/50 in favor of this.

May 16 '22 22:05 max-sixty

Malloy uses pick for these:

size ?
  pick 'small' when < 10
  pick 'medium' when < 20
  else 'large'

Tangentially, something nice in Malloy is "partial comparison"; the pick becomes an expression rather than a statement; or another way of saying this in that it curries size ?. So the above is equivalent to:

  pick 'small' when size < 10
  pick 'medium' when size < 20
  else 'large'

ref #71

May 16 '22 22:05 max-sixty

@max-sixty Is there any workaround for this currently? Have you had any further thoughts on what direction you'd like to go?

It seems like an if expression might be nice:

select [ 
    property = if x { if y > 200 { 'foo' } else { 'bar' } } else { 'baz' } 
]

Seems like you can use S-strings to at least get something working and have the benefit of reusing definitions:

# in the employees sample
derive [
    tenure_or_zero = tenure ?? 0,
    tenure_type = s"""IF {tenure_or_zero} > 4 THEN 'senior' ELSE IF {tenure_or_zero} > 10 'legend' ELSE 'normal' END"""
  ]

generates:

IF COALESCE(newest_employees.tenure, 0) > 4 THEN 'senior'
  ELSE IF COALESCE(newest_employees.tenure, 0) > 10 'legend'
  ELSE 'normal'
END AS tenure_type

Aug 12 '22 15:08 Ryman

Yes, the s-string is the classic escape hatch. But I think this is one of the few areas that's really missing from the language for it to be broadly useful without hacks, and it's worth us adding it next.

What do you think about the options above? Some of them are quite easy to do — e.g. a tertiary. The match & case are cooler, but would require more thought.

(we don't have a precise notion of blocks — e.g. if y > 200 { 'foo' } — but we can use functions and parentheses to do something similar)

Aug 12 '22 20:08 max-sixty

I agree that this is one of the major missing features at the moment. I hit this as soon as I tried to implement the first example from the dbt documentation in PRQL. Big vote in favour of making this part of the 0.3 milestone.

I also think it's important enough to consider well. I'm not well versed enough in the benefits of match vs case vs if expressions but I am strongly against tertiary operator\expressions. If we had a Quadratic Voting procedure I'd use a lot of my budget on this. I find tertiary expressions very hard to read. I believe one of the largest bikeshedding discussions for Prettier was around the best format for nested tertiaries and when I saw how these get used in Javascript/Typescript, I nearly fell off my chair.

In my opinion, one of the key reason behind Python's success and popularity is its readability, and for me using and \ or instead of && \ || and expr_a if cond else expr_b instead of cond ? expr_a : expr_b are a large part of that. We also strive for terseness but I would put readability first. I think similarly SQL has been relatively easy to pick up for non-programmers due to its readability but it steered too far away from terseness (as well as composability and orthogonality).

Edit: Not saying I'm necessarily in favour of the Python ordering of A if cond else B and would be happy with if cond then A else B but I strongly prefer keywords over symbols in this case.

Aug 13 '22 10:08 snth

@aljazerzen WDYT about the case & match expression / pipelines?

I think they could be good, but they would extend our syntax a bit, and so are worth considering carefully.

For match, would this work:

match foo (
  in 2..5 -> "low"
  in 6..8 -> "medium"
  == 10 -> "high"
  _ -> "invalid"
)

It's kinda a pipeline, kinda a match expression using guards.

We could also use a list rather than a pipeline.

Aug 14 '22 17:08 max-sixty

I hear the point about tertiary expressions being difficult to read, especially when compounded, but it does fit with the language.

Basically everything in prql is an expression rather than a statement, and so an if foo then bar else baz doesn't really fit. We can extend the language to fit it — like we would for match, but there's some downside.

Aug 14 '22 18:08 max-sixty

Sorry, I should have been careful in what I was saying and it's really the tertiary operator ? : rather than a tertiary expression that I am against.

>>> (lambda x: 1 if x>0 else -1 if x<0 else 0)(-5)
-1

works in Python so you could well have a tertiary expression just using if \ then \ else keywords rather than the ? : symbols.

Aug 15 '22 07:08 snth

I agree with all said:

tertiary operator is indeed hard to read and (in my opinion) archaic,
match would require us to extend the syntax, but I like syntax max proposed (let's call it option 1),
if could be implemented as a simple function:
```
(option 2)
select cost = (if is_promo 0 total)
```
... but that may not be verbose enough. If we add then and else:
```
(option 3)
select cost = (if is_promo then 0 else total)
```
This may still be parsable as a normal function call. I'd like to add that I think we should avoid Python-like A if cond else B in favor of if cond then A else B, because in my opinion, having condition in the middle is confusing, since all other if statements and tertiary operators I know have the condition at the start.

To be a complete language, I think we need both if and match, because if is convenient for quick null checks, while match is needed for mapping enums and such. Maybe we could do without match if we would go with option 2.

Aug 16 '22 14:08 aljazerzen

We could add an if function quickly — my one concern is that if test when_true when_false doesn't pipe well, since piping puts the item being piped last. So this looks reasonable:

if (x>5) "high" "low"

but piping doesn't work

(x>5) | if "high" "low"  # equivalent to: if "high" "low" (x>5)

The alternative is probably a tertiary expression / operator!

(x>5) ? "high" "low" # equivalent to: if (x>5) "high" "low"

Then match / case is important but a bigger decision — e.g. do we need both; are the conditions a pipeline or a list; do we need both; can we offer exhaustive matching.

If there's consensus on the if func, we can go ahead with that?

Aug 17 '22 17:08 max-sixty

So no (: else operator ) in the above proposal ?

Proposed

(x>5) ? "high" "low

I like piping to ? , but think it feels odd without the :

If we have a full featured match I think this would only be used for inline if/elses like most modern use cases.

Aug 17 '22 18:08 qharlie

So no (: else operator ) in the above proposal ?

Depends whether this is a standard function or new syntax. Probably we don't want a : because we use that already, I'm completely fine with something there though...

Aug 17 '22 19:08 max-sixty

Interesting. @max-sixty I didn't really get your if function before but I think I got it now.

I quite like the idea of ? as an infix indexing operator, like the APL ⊃ Pick function (ignoring the nesting), which in a sense generalises the if function beyond the [0, 1] range. You have the slight complication that we usually write the then expression (corresponding to index 1) before the else expression (corresponding to index 0). Perhaps that could be solved with a thenelse function which takes two arguments and turns that into an array / list with the order reversed?

I did a small Python prototype to illustrate how I see this working. There's no ? operator in Python so I used ^ instead for the example.

class Choose:
    "Implements ^ Choose operator for lists"
    def __init__(self, choices):
        self.choices = choices
    def __rxor__(self, choice):
        return dict(enumerate(self.choices)).get(choice)
    def __repr__(self):
        return f"{self.__class__.__name__}({self.choices})"

>>> print(f"{0 ^ Choose(['a', 'b', 'c'])=}")
0 ^ Choose(['a', 'b', 'c'])='a'

>>>print(f"{1 ^ Choose(['a', 'b', 'c'])=}")
1 ^ Choose(['a', 'b', 'c'])='b'

>>>print(f"{2 ^ Choose(['a', 'b', 'c'])=}")
2 ^ Choose(['a', 'b', 'c'])='c'

The thenelse function wraps values in a list and reverses the order so they work with the ^ operator

def thenelse(then_, else_):
    return Choose([else_, then_])

>>> print(f"{thenelse('a', 'b')=}")
thenelse('a', 'b')=Choose(['b', 'a'])

>>> print(f"{True ^ thenelse('a', 'b')=}")
True ^ thenelse('a', 'b')='a'

>>> print(f"{False ^ thenelse('a', 'b')=}")
False ^ thenelse('a', 'b')='b'

Finally an example of how you might want to use this:

import pandas as pd

df = pd.DataFrame({'a':range(4)})
df['b'] = df.a.apply(lambda a: (a > 1) ^ thenelse('high', 'low'))
df['c'] = df.a.apply(lambda a: a ^ Choose(['one', 'two', 'three']))
df

	a	b	c
0	0	low	one
1	1	low	two
2	2	high	three
3	3	high	None

So the proposed PRQL syntax could be

(x>5) ? thenelse "high" "low

and what you would gain from complicating things with the thenelse is that you could also do things like

2 ? ['one' 'two' 'three']

Aug 18 '22 10:08 snth

The Python example generalises quite nicely to a more general Pick function / operator:

class Pick:
    def __init__(self, *args, **kwds):
        self.choices = {**dict(enumerate(args)), **kwds}
    def __rxor__(self, choice):
        return self.choices.get(choice)
    def __repr__(self):
        return f"{self.__class__.__name__}({self.choices})"

>>> print(f"{True ^ Pick('falsy', 'truthy', 'two', a='A', b='B')=}")
True ^ Pick('falsy', 'truthy', 'two', a='A', b='B')='truthy'

>>> print(f"{False ^ Pick('falsy', 'truthy', 'two', a='A', b='B')=}")
False ^ Pick('falsy', 'truthy', 'two', a='A', b='B')='falsy'

>>> print(f"{0 ^ Pick('falsy', 'truthy', 'two', a='A', b='B')=}")
0 ^ Pick('falsy', 'truthy', 'two', a='A', b='B')='falsy'

>>> print(f"{1 ^ Pick('falsy', 'truthy', 'two', a='A', b='B')=}")
1 ^ Pick('falsy', 'truthy', 'two', a='A', b='B')='truthy'

>>> print(f"{2 ^ Pick('falsy', 'truthy', 'two', a='A', b='B')=}")
2 ^ Pick('falsy', 'truthy', 'two', a='A', b='B')='two'

>>> print(f"{'a' ^ Pick('falsy', 'truthy', 'two', a='A', b='B')=}")
'a' ^ Pick('falsy', 'truthy', 'two', a='A', b='B')='A'

I don't know how any of this would be implemented on the transpiled SQL side though.

Aug 18 '22 10:08 snth

Thanks a lot for the excellent examples and APL reference @snth ! Very happy to have more inspiration from APL, please keep supplying it...

(also I didn't know about the f-string f"{x=}" functionality, that's really useful, kinda like dbg! in rust)

I personally quite like the generalization from the tertiary operator to the "pick" / "index" function. A couple of tradeoffs:

It's less intuitive to have test <op> false true than test <op> true false, and wrapping the list in a thenelse is burdensome. Though we could always implement if in terms of this.
While I like APL a lot, sometimes it's easier to write than read, and this may be one of those places. Possibly we want something like if for very short conditions, or a full explicit match for longer ones. But very open-minded!
(very minor) SQL int / bool often require an explicit cast

Aug 18 '22 18:08 max-sixty

Yeah I agree with you and I'm not really in favour of my pick / index function idea myself as I don't think it's very readable and doesn't naturally map onto any underlying SQL concepts. It was mostly a case of me having the realisation that if then else was a special case of array indexing with 0 and 1 and letting the idea run to see where it would lead. It makes sense for APL where arrays are front and center but that's not the case for SQL so it's not a good fit here.

Coming back to case / match, which I think is really the first prize here that we need to aim for, I prefer sticking closer to the SQL terms as I think PRQL's primary appeal will be for people looking for a more concise and consistent SQL, at least for a time to come. Therefore I would favour

case (
  a -> b
  c -> d
  _ -> e
)

over the equivalent match.

From what I can tell, the semantic difference between case and match (and I missed this the first time around because my eye skipped over the single letter), is that case allows arbitrary expressions in each clause while match matches a particular expression against conditions.

So with case you can do

select status = case [
    age > 65 -> "retired",
    is_suspended -> "suspended",
    _ -> "working"
]

which I don't think you could do with a match.

Also, case maps onto CASE in SQL in a straightforward manner, so apart from the language syntax extension it should be simple to implement. I imagine pattern matching around match might be more complicated (unless you're just doing simple substitutions of the match_expr as in the Malloy pick example).

In terms of how this would be implemented, in SQL there is no if, so a if cond then A else B would really just be syntactic sugar for

case (
  cond -> A
  _ -> B
)

. So I just want to raise the question of whether we need it?

I guess an answer to that would be that for inline expressions, the if then else syntax might be more convenient and readable. So taking @aljazerzen 's examples

(option 3)
select cost = (if is_promo then 0 else total)

might read better than

(option 1, or should it be 4?)
select cost = case [is_promo -> 0,  _ -> total]
]

Or does it? I think I'm actually finding that case example easier to read.

Somewhat off topic but related, what are our syntax rules for () vs []? @max-sixty your initial case example used () and it's something that I've noticed in the PRQL book as well that I don't really understand properly.

In the following example from the PRQL book, why does the group take () while the aggregate takes []?

from employees
group [title, country] (
  aggregate [
    average salary,
    ct = count
  ]
)

Aug 19 '22 09:08 snth

I think that case is the way to go. You are quite convincing :)

I think that in your example, if then else is more readable, but that can be added later as syntactic sugar if there will be enough votes to do so.

Regarding () vs []:

[] are used for arrays (many elements of the same type, order matters),
() are for pipelines (many functions, applied one onto another). In special case of only one element, this is equivalent to just that element (just as normal parenthesized expression). Zero elements would be equivalent to null?

So group needs parentheses because it needs a pipeline - a function that is applied to each group.

Aggregate needs a list of expressions (order matters), so thats an array.

With case order also matters and I dont think we could shape this to be a pipeline of functions, so I'd say it should be an array of "conditions".

Aug 29 '22 11:08 aljazerzen

Just pinging this issue since it was mentioned in https://github.com/prql/prql/discussions/1145#discussioncomment-4186499.

I see this as quite a priority as CASE statements are needed for many things in SQL. I think we need this ready at least by the time of our conference talks in March next year. For example translating the first example from the dbt docs is still blocked by this feature. Perhaps something we can look at once the semantic branch is done?

Nov 23 '22 08:11 snth

I think we can close this in favor of #1286 & #1332!

Jan 02 '23 02:01 max-sixty

prql prql copied to clipboard

`if` or `case` expressions

prql
prql copied to clipboard