prql count this vs count column

trafficstars

What's up?

It seems that using

from tbl | select {count colA}

translates to COUNT(*)

This is rather counter intuitive and feels like lack of consistency. Moreover it is as well incorrect when colA contains NULL values, which are counted by COUNT(*) but not by COUNT(colA). IMO it is reasonable to translate this query into COUNT(colA) and update documentation to suggest to use

from tbl | select {count this}

to get COUNT(*).

When support for count distinct ~~will be added~~ (just noticed it is already there as count_distinct), then inconsistency reported here will be even more clearly visible because (working example in #3632):

count colA will give same answer as count this:

COUNT(colA) == COUNT(*)

count distinct colA will give different answer than count distinct this (even if there are no NULLs):

COUNT(DISTINCT colA) != COUNT(DISTINCT t.*)

Oct 07 '23 10:10 jangorecki

I'm trying to recall whether counting nulls was a deliberate decision or not. It's not specifically discussed in Null handling.

@aljazerzen do you recall / have a view?

Oct 15 '23 09:10 max-sixty

That's a part of an unfinished feature. Here is the long-term plan: https://github.com/PRQL/prql/issues/3139#issuecomment-1655383639

Oct 15 '23 10:10 aljazerzen

I've been working with python arrays a lot in the past few weeks. If we want a smaller change, one option would be to have a len / length function that calculates the length of arrays, while still skipping nulls in count. (No strong view on doing this — just raising as an option)

Oct 16 '23 18:10 max-sixty

count * will cover what length array would do. It includes nulls always, on the other hand count col never includes nulls. Not sure if it make sense to include new function when this behavior can be simply inherited from SQL.

Oct 16 '23 19:10 jangorecki

count * will cover what length array would do. It includes nulls always, on the other hand count col never includes nulls. Not sure if it make sense to include new function when this behavior can be simply inherited from SQL.

I don't think these are great semantics though — that a table with a single column bar.foo has different results for count(bar.*) and count(bar.foo)...

Oct 16 '23 19:10 max-sixty

It's not perfect, but that's how SQL settled don't this behavior and there is no inconsistency there, it is just not completely intuitive. If we don't follow SQL it will ended up being even less intuitive.

Oct 17 '23 06:10 jangorecki

That's been a common dilemma in PRQL lang design: should we design for a user that knows SQL and just wants pipelines, or do we assume that the user does not know SQL at all and has only basic programming language knowledge.

Initially, we were mostly working with the former assumption, but in many recent cases I feel that it's a shame that we'd inherit suboptimal design from a language, that we aim to replace.

This does increase the transition cost of migration to PRQL, but I hope we can mitigate this with useful error messages that guide incorrect patterns that SQL people would expect toward our new design.

Oct 17 '23 07:10 aljazerzen

Agree, if diverting from SQL then error would be nice, so there is no way for unexpected results to sneak in. Another way that could be incorporated is to create a poll/blog post/twitter or whatever asking for users opinion. This also has advantage of building up community by engaging users to vote.

Oct 17 '23 08:10 jangorecki

+1 @jangorecki

One thing we could start with is "How are PRQL's language semantics different from SQL" — i.e. less focused on the pipeline approach, and more on things like count and null. That will also help garner feedback on whether these are long-awaited clearing of cobwebs vs. annoying small differences.

Oct 19 '23 20:10 max-sixty

prql prql copied to clipboard

count this vs count column

What's up?

prql
prql copied to clipboard