prql icon indicating copy to clipboard operation
prql copied to clipboard

Syntax to `filter` a list of boolean expressions to be ANDed

Open ajfriend opened this issue 2 years ago • 8 comments

I often have queries with a long list of filters. Here's a nonsense made up example:

from tbl_a
filter (
    foo > 100 and foo < 200 and bar != null and (average bar > 3) and
)

I understand that filter takes in a single boolean expression, but I'd also say that the most common use case of filter is to AND a list of boolean expressions. Would a syntax something like this make sense?

from tbl1
filter [
    foo > 100,
    foo < 200,
    bar != null,
    average bar > 3,
]

This has the advantage of being able to break up the long list over multiple lines, and removes the need to type and all over the place. (And note theres a "trailing AND" issue that's synonymous with the "trailing comma" issue that PRQL already addresses.)

There's also a chance that there's a way to do this with the current syntax that I'm missing. Please let me know if that's the case! I just didn't see many examples in the docs that involved several ANDed filters. I suppose one way to achieve this would be something like

from tbl_a
filter foo > 100
filter foo < 200
filter bar != null
filter (average bar > 3)

which is great, but maybe not as clean as the list proposal above?

ajfriend avatar Jul 09 '22 22:07 ajfriend

Also maybe interesting, the two (working) examples above seem to compile differently:

WITH table_0 AS (
  SELECT
    tbl_a.*
  FROM
    tbl_a
)
SELECT
  table_0.*
FROM
  table_0
WHERE
  foo > 100
  AND foo < 200
  AND bar IS NOT NULL
  AND AVG(bar > 3) OVER ()
WITH table_0 AS (
  SELECT
    tbl_a.*
  FROM
    tbl_a
  WHERE
    foo > 100
    AND foo < 200
    AND bar IS NOT NULL
)
SELECT
  table_0.*
FROM
  table_0
WHERE
  AVG(bar > 3) OVER ()

And the first one seems to create an unnecessary CTE? (Not that I'm really concerned or complaining about it, I just noticed it and figured it was worth mentioning.)

ajfriend avatar Jul 09 '22 22:07 ajfriend

But maybe its the AVG(bar > 3) that's kind of weird/nonsenical here...

ajfriend avatar Jul 09 '22 22:07 ajfriend

Yeah, if I drop the average, they compile the same (without the extra CTE):

SELECT
  tbl_a.*
FROM
  tbl_a
WHERE
  foo > 100
  AND foo < 200
  AND bar IS NOT NULL

ajfriend avatar Jul 09 '22 22:07 ajfriend

Thanks for the issue @ajfriend .

We touched on this here: https://github.com/prql/prql/issues/469.

I'm open-minded; I tend to agree that a list is quite a logical conclusion from a list that it would be and, but if it's confusing for some, then having multiple filter expressions seems reasonable.

max-sixty avatar Jul 09 '22 22:07 max-sixty

Well, consider this a vote for this syntax :)

ajfriend avatar Jul 09 '22 23:07 ajfriend

Although I don't personally like the idea of default filtering implicitly being AND-ed together (I prefer explicit things), it's slightly weird that select, filter and derive all function differently when it comes to square brackets:

  • with derive, multiple operations can be combined into one using derive [a1 = b1, a2 = b2]
  • with select, you can use select [a, b] but it's not equivalent to select a | select b
  • with filter, [ ... ] isn't allowed at all

mklopets avatar Jul 10 '22 16:07 mklopets

Average in filter

Woah, I didn't expect average to be used in filter... I not sure what you wanted to do with that, but we should have some kind of error for what you've done :smile:

When used outside of aggregate, the function average (and many others) are considered to be window functions (mapping one column into another column). Because average produces only one value, this value is broadcast to the whole column. This is useful for example here:

from costs
derive normalized_cost = cost / (sum cost)

But broadcasting does not make sense in filter, since all values would have the same value (at least without group/PARTITION BY). So maybe we should throw an error here.

#420 #435 #478

Why two CTEs?

Window functions are actually not allowed anywhere but in SELECT. If a they appear somewhere else, we have to introduce a new CTE, define them as a column there, use the column in the next CTE and drop the column some time later.

In this case, this is clearly not working correctly and should be fixed. #806

Filter with lists

I see your point of list being easier than and-ing expressions or chaining filters. But as said in linked issue, I prefer being explicit here.

That is only my opinion; future readers that would like to have filter consume lists: add a like to this issue!

Inconsistency of list arguments

It's true that some transforms coerce their arguments into lists and some do not. But I don't see a way where we could have this done consistently across all transforms, because of their nature - they do different things which may or may not have an obvious generalization for multiple columns.

  • select picks some columns and discards the rest. Not having lists here is not possible.
  • take has no obvious way of using a list of arguments, so it should produce an error.
  • sort has an obvious way (sort [year, month, day]), but in many cases there will be only one argument and accepting only sort cost is (I think) a great feature

aljazerzen avatar Jul 11 '22 10:07 aljazerzen

Edit: The original comment I made here was a bit off-topic, so I moved it over to https://github.com/prql/prql/issues/819.

ajfriend avatar Jul 11 '22 21:07 ajfriend

There are currently no plans to add this language feature, so I'm closing this issue.

aljazerzen avatar Dec 15 '22 13:12 aljazerzen