prql
prql copied to clipboard
count this vs count column
What's up?
It seems that using
from tbl | select {count colA}
translates to COUNT(*)
This is rather counter intuitive and feels like lack of consistency. Moreover it is as well incorrect when colA contains NULL values, which are counted by COUNT(*) but not by COUNT(colA).
IMO it is reasonable to translate this query into COUNT(colA) and update documentation to suggest to use
from tbl | select {count this}
to get COUNT(*).
When support for count distinct ~~will be added~~ (just noticed it is already there as count_distinct), then inconsistency reported here will be even more clearly visible because (working example in #3632):
count colAwill give same answer ascount this:
COUNT(colA) == COUNT(*)
count distinct colAwill give different answer thancount distinct this(even if there are no NULLs):
COUNT(DISTINCT colA) != COUNT(DISTINCT t.*)
I'm trying to recall whether counting nulls was a deliberate decision or not. It's not specifically discussed in Null handling.
@aljazerzen do you recall / have a view?
That's a part of an unfinished feature. Here is the long-term plan: https://github.com/PRQL/prql/issues/3139#issuecomment-1655383639
I've been working with python arrays a lot in the past few weeks. If we want a smaller change, one option would be to have a len / length function that calculates the length of arrays, while still skipping nulls in count. (No strong view on doing this — just raising as an option)
count * will cover what length array would do. It includes nulls always, on the other hand count col never includes nulls. Not sure if it make sense to include new function when this behavior can be simply inherited from SQL.
count * will cover what length array would do. It includes nulls always, on the other hand count col never includes nulls. Not sure if it make sense to include new function when this behavior can be simply inherited from SQL.
I don't think these are great semantics though — that a table with a single column bar.foo has different results for count(bar.*) and count(bar.foo)...
It's not perfect, but that's how SQL settled don't this behavior and there is no inconsistency there, it is just not completely intuitive. If we don't follow SQL it will ended up being even less intuitive.
That's been a common dilemma in PRQL lang design: should we design for a user that knows SQL and just wants pipelines, or do we assume that the user does not know SQL at all and has only basic programming language knowledge.
Initially, we were mostly working with the former assumption, but in many recent cases I feel that it's a shame that we'd inherit suboptimal design from a language, that we aim to replace.
This does increase the transition cost of migration to PRQL, but I hope we can mitigate this with useful error messages that guide incorrect patterns that SQL people would expect toward our new design.
Agree, if diverting from SQL then error would be nice, so there is no way for unexpected results to sneak in. Another way that could be incorporated is to create a poll/blog post/twitter or whatever asking for users opinion. This also has advantage of building up community by engaging users to vote.
+1 @jangorecki
One thing we could start with is "How are PRQL's language semantics different from SQL" — i.e. less focused on the pipeline approach, and more on things like count and null. That will also help garner feedback on whether these are long-awaited clearing of cobwebs vs. annoying small differences.