prql
prql copied to clipboard
Syntax to `filter` a list of boolean expressions to be ANDed
I often have queries with a long list of filters. Here's a nonsense made up example:
from tbl_a
filter (
foo > 100 and foo < 200 and bar != null and (average bar > 3) and
)
I understand that filter
takes in a single boolean expression, but I'd also say that the most common use case of filter
is to AND
a list of boolean expressions. Would a syntax something like this make sense?
from tbl1
filter [
foo > 100,
foo < 200,
bar != null,
average bar > 3,
]
This has the advantage of being able to break up the long list over multiple lines, and removes the need to type and
all over the place. (And note theres a "trailing AND" issue that's synonymous with the "trailing comma" issue that PRQL already addresses.)
There's also a chance that there's a way to do this with the current syntax that I'm missing. Please let me know if that's the case! I just didn't see many examples in the docs that involved several ANDed filters. I suppose one way to achieve this would be something like
from tbl_a
filter foo > 100
filter foo < 200
filter bar != null
filter (average bar > 3)
which is great, but maybe not as clean as the list
proposal above?
Also maybe interesting, the two (working) examples above seem to compile differently:
WITH table_0 AS (
SELECT
tbl_a.*
FROM
tbl_a
)
SELECT
table_0.*
FROM
table_0
WHERE
foo > 100
AND foo < 200
AND bar IS NOT NULL
AND AVG(bar > 3) OVER ()
WITH table_0 AS (
SELECT
tbl_a.*
FROM
tbl_a
WHERE
foo > 100
AND foo < 200
AND bar IS NOT NULL
)
SELECT
table_0.*
FROM
table_0
WHERE
AVG(bar > 3) OVER ()
And the first one seems to create an unnecessary CTE? (Not that I'm really concerned or complaining about it, I just noticed it and figured it was worth mentioning.)
But maybe its the AVG(bar > 3)
that's kind of weird/nonsenical here...
Yeah, if I drop the average
, they compile the same (without the extra CTE):
SELECT
tbl_a.*
FROM
tbl_a
WHERE
foo > 100
AND foo < 200
AND bar IS NOT NULL
Thanks for the issue @ajfriend .
We touched on this here: https://github.com/prql/prql/issues/469.
I'm open-minded; I tend to agree that a list is quite a logical conclusion from a list that it would be and
, but if it's confusing for some, then having multiple filter
expressions seems reasonable.
Well, consider this a vote for this syntax :)
Although I don't personally like the idea of default filtering implicitly being AND-ed together (I prefer explicit things), it's slightly weird that select
, filter
and derive
all function differently when it comes to square brackets:
- with
derive
, multiple operations can be combined into one usingderive [a1 = b1, a2 = b2]
- with
select
, you can useselect [a, b]
but it's not equivalent toselect a | select b
- with
filter
,[ ... ]
isn't allowed at all
Average in filter
Woah, I didn't expect average
to be used in filter
... I not sure what you wanted to do with that, but we should have some kind of error for what you've done :smile:
When used outside of aggregate
, the function average
(and many others) are considered to be window functions (mapping one column into another column). Because average produces only one value, this value is broadcast to the whole column. This is useful for example here:
from costs
derive normalized_cost = cost / (sum cost)
But broadcasting does not make sense in filter, since all values would have the same value (at least without group/PARTITION BY). So maybe we should throw an error here.
#420 #435 #478
Why two CTEs?
Window functions are actually not allowed anywhere but in SELECT. If a they appear somewhere else, we have to introduce a new CTE, define them as a column there, use the column in the next CTE and drop the column some time later.
In this case, this is clearly not working correctly and should be fixed. #806
Filter with lists
I see your point of list being easier than and
-ing expressions or chaining filters
. But as said in linked issue, I prefer being explicit here.
That is only my opinion; future readers that would like to have filter consume lists: add a like to this issue!
Inconsistency of list arguments
It's true that some transforms coerce their arguments into lists and some do not. But I don't see a way where we could have this done consistently across all transforms, because of their nature - they do different things which may or may not have an obvious generalization for multiple columns.
-
select
picks some columns and discards the rest. Not having lists here is not possible. -
take
has no obvious way of using a list of arguments, so it should produce an error. -
sort
has an obvious way (sort [year, month, day]
), but in many cases there will be only one argument and accepting onlysort cost
is (I think) a great feature
Edit: The original comment I made here was a bit off-topic, so I moved it over to https://github.com/prql/prql/issues/819.
There are currently no plans to add this language feature, so I'm closing this issue.