Pagination CountWalker should run a COUNT(*) query instead of COUNT(tbl.id) when HINT_DISTINCT is false
Feature Request
| Q | A |
|---|---|
| New Feature | no |
| RFC | no |
| BC Break | no |
Summary
I'd like to propose that the Doctrine\ORM\Tools\Pagination\CountWalker should create a count query that selects COUNT(*) instead of COUNT(tbl.id) when a query's HINT_DISTINCT is set/declared false. Both "counts" result in the same number being produced, however COUNT(*) allows some databases (e.g. Postgres) to finish the query faster. Please see the following query plans.
// COUNT(tbl.id) case
EXPLAIN SELECT count(o0_.id) AS sclr_0 FROM "order" o0_;
QUERY PLAN
Finalize Aggregate (cost=40598.68..40598.69 rows=1 width=8)
-> Gather (cost=40598.46..40598.67 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=39598.46..39598.47 rows=1 width=8)
-> Parallel Seq Scan on ""order"" o0_ (cost=0.00..38594.17 rows=redacted width=4)
---
// COUNT(*) case
EXPLAIN SELECT count(*) AS sclr_0 FROM "order" o0_;
QUERY PLAN
Finalize Aggregate (cost=36422.71..36422.72 rows=1 width=8)
-> Gather (cost=36422.49..36422.70 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=35422.49..35422.50 rows=1 width=8)
-> Parallel Index Only Scan using idx_f5299398f543e763 on ""order"" o0_ (cost=0.42..34418.20 rows=redacted width=0)
Notice how in the case of COUNT(*), Postgres is counting using an "Index Only Scan". The speed difference isn't substantial (at least for the number of rows I tested with), but it's also quite cheap to obtain (by just making the code use COUNT(*)).
Thank you.
Do you have an explanation as to why the query plan differs from one query to the other, and are you sure that explanation will apply regardless of the database in use? I'm not sure we should introduce more complexity, especially if it is not possible to measure a substantial speed difference. But I agree that "Seq Scan" does not sound good.
Do you have an explanation as to why the query plan differs from one query to the other
I've done some research just now and the short answer appears to be that COUNT(*) gives more opportunities to the database to seek alternative ways to fulfil the query than when COUNT(tbl.id) is used.
My other observations:
- A lot of sources state that
COUNT(column_name)tells the database to count not-null values of the column_name, whileCOUNT(*)tells the database to just count the rows, regardless of their contents. Postgres in particular is sometimes willing to use an index-only scan to fulfil theCOUNT(*)query. The apparent non-obvious thing here is thatCOUNT(*)is an equivalent ofCOUNT(), but it's notCOUNT()because that's a syntax error in SQL. - Despite the above, I found that both COUNT() queries allow Postgres to attempt to use an index-only scan, however
COUNT(*)allows to use any index available on the table in question, whileCOUNT(tbl.id)allows to use only the (usually primary) index on theidcolumn. This for some reason is sometimes less eagerly used if there's been a period of time since the lastvacuum analyzewas carried out on the table in question, because when Ivacuum analyze'd my test table, then Postgres started using an index-only scan also for theCOUNT(tbl.id)query. - The time difference between a "parallel seq scan" and a "parallel index-only scan" seems to be under 20ms for my test tables (in favour of the index-only scan). It makes me believe that Postgres is just accurately concluding that there will be no difference between the two, and just chooses to sometimes use a sequential scan in case of the
COUNT(tbl.id).
are you sure that explanation will apply regardless of the database in use?
I'm not sure but it's very plausible. For what it's worth, chatgpt is under impression (and bullet-points some examples) that most of the databases benefit from being told to do COUNT(*) instead of COUNT(column_name), when the column_name is not nullable (and primary keys aren't supposed to be nullable).
I'm not sure we should introduce more complexity, especially if it is not possible to measure a substantial speed difference.
After going down the rabbit hole here, I'm also less inclined now to make this change. The only standing argument for that change is what I said at the top: COUNT(*) gives more opportunities for optimised querying than COUNT(tbl.id), however, as I later said, I wasn't able to confirm spectacular speed difference in my test data.
For this reason I'm fine to close this ticket now without any code change, unless you'd like (me) to make that change anyway due to the reason I explained above.
Thanks for the very detailed explanation. From what you are telling, I'm starting to think that to observe a significant performance difference, we would have to add a where clause involving columns other than the primary key (I'm leaving aside the case of composite primary keys for the sake of simplicity). Is that what you observe ?
I'm starting to think that to observe a significant performance difference, we would have to add a where clause involving columns other than the primary key (...) Is that what you observe?
Interesting. You're absolutely right. When I add a simple where-condition on a column that is indexed, the index-only scan is never used in the case of COUNT(tbl.id) (regardless of whether the table is vacuum'd or not), but is always used when COUNT(*) is used. In this particular case, I can see 50-75ms speed difference in favour of COUNT(*). Whether that's significant is open to debate.
I'm leaving aside the case of composite primary keys for the sake of simplicity
The CountWalker already throws an exception when the table being queried uses a composite primary key. So we don't need to discuss that case here.
So yeah, still down to you if you wish to carry on with this code change. I'm somewhat more inclined to it now.
Edit: I just tried different where-conditions, and as one could imagine, if the where-condition filters out significant amount of the table, the speed difference is significant. E.g. 0.1s vs 0.001s.
Great! Using Git, can you find out why we are not using (*) in the first place? Likewise, if you switch the code to (*)< do any tests break?
why we are not using (*) in the first place?
Git history shows that the code does what it does since the very beginning (2012) and it wasn't attempted to change it to conditionally doing COUNT(*).
if you switch the code to (*)< do any tests break?
I made that change and no tests failed. In particular, CountWalkerTest didn't fail because it never checks for the case when the HINT_DISTINCT is set to false.
I checked this code change in my webapp and it's effective (i.e. COUNT(*) is run conditionally), and I didn't experience any regressions.
Then please send a PR against the next minor branch, I think I'm OK with this change 👍
Could you tell me whether you mean the 3.3.x or the 3.2.x branch, please?
It that's an option, I would backport it to 2.20.x and then "replay" the code change from 2.20.x to every branch above it.
2.x is only open for bugfixes. This does not look like a bug at all. The next minor branch is 3.3.x (since the last tag is 3.2.1).