dolt
dolt copied to clipboard
Incorrect result from aggregating on table joined in both subquery and CTE
Hi, earlier I tried to do a quick COUNT(*) to spot check some join logic in a CTE. The output appears to be a query planning bug? Seems to be skipping my join condition entirely.
My db is here:
https://www.dolthub.com/repositories/snipesjr/npi_registry
Subquery example (bad):
SELECT p.taxonomy_code AS taxonomy_classification
, COUNT(*) AS provider_count # gives incorrect count
# , COUNT(DISTINCT p.npi) AS provider_count # gives query plan error
FROM (
SELECT pro.npi
, IFNULL(ext.taxonomy_code, 'UNKNOWN') AS taxonomy_code
FROM npi_provider pro
LEFT JOIN npi_provider_extra ext
ON (pro.npi = ext.npi
AND ext.primary_taxonomy_switch IN ('Y', 'X'))
) AS p
GROUP BY p.taxonomy_code
ORDER BY COUNT(*) DESC;
CTE example (bad):
WITH providers AS (
SELECT pro.npi
, IFNULL(ext.taxonomy_code, 'UNKNOWN') AS taxonomy_code
FROM npi_provider pro
LEFT JOIN npi_provider_extra ext
ON (pro.npi = ext.npi
AND ext.primary_taxonomy_switch IN ('Y', 'X'))
)
SELECT p.taxonomy_code
, COUNT(*) AS provider_count
FROM providers AS p
GROUP BY p.taxonomy_code
ORDER BY COUNT(*) DESC;
Non-subquery/CTE (good):
SELECT IFNULL(ext.taxonomy_code, 'UNKNOWN') AS taxonomy_code
, COUNT(*) AS provider_count
FROM npi_provider pro
LEFT JOIN npi_provider_extra ext
ON (pro.npi = ext.npi
AND ext.primary_taxonomy_switch IN ('Y', 'X'))
GROUP BY IFNULL(ext.taxonomy_code, 'UNKNOWN')
ORDER BY COUNT(*) DESC
My dolt version:
$ dolt version
dolt version 0.41.3
database storage format: OLD ( __LD_1__ )
I tried to test the query in the web ui but my query timed out.
Update. The subquery works in dolt sql-client
, but returns wrong result in Pycharm. Could be driver related?
DBMS: MySQL (ver. 5.7.9-Vitess)
Case sensitivity: plain=exact, delimited=exact
Driver: MySQL Connector/J (ver. mysql-connector-java-8.0.25 (Revision: 08be9e9b4cba6aa115f9b27b215887af40b159e0), JDBC4.2)
Effective version: Vitess (ver. 0.0)
Nice bug. @max-hoffman has been grinding on CTE issues so he'll grab it once he frees up a bit.
@snipesjr thank you for the report! I would appreciate any additional debugging context. For example, if the queries are returning incorrect results, sharing how the results are different than expected, or how the EXPLAIN WITH cte ...
output plans are different with your connector would make it easier for us to repro on our side.
Note, Max ran the queries in withdolt sql
and they all seem to work/return the same results:
![image](https://user-images.githubusercontent.com/5376236/191629540-5320cfb1-141a-49ba-b5d6-86f7fca4d987.png)
So, we also suspect some sort of client issue. We need more information to keep debugging.
Yes, for sure. I will send the output that I am getting from within pycharm tomorrow (new laptop). Dolt’s cli client returned the correct results, so it might be an issue elsewhere. Nevertheless, wanted to report it since others connecting using jetbrains products might bump into it.
We love the report! Just want to get it fixed for you :-)
@snipesjr Just to clarify, query plans or the modified query strings might be easier to share than full result sets. If you set --trace debug
, the server should log query strings, which might be different than what you thought you ran (ex: they added a filter or limit clause). You can get query plans by running EXPLAIN SELECT ...
. Either of these should let us reproduce the different result sets you see.
Sharing first 10 records from CTE example (will share the rest):
taxonomy_code | provider_count |
---|---|
UNKNOWN | 18 |
207R00000X | 17 |
207Q00000X | 17 |
1223G0001X | 15 |
225100000X | 14 |
1041C0700X | 13 |
367500000X | 12 |
111N00000X | 12 |
183500000X | 11 |
101YP2500X | 10 |
Explain CTE (PyCharm):
plan |
---|
Limit(10) |
└─ Project |
├─ columns: [p.taxonomy_code, COUNT(*) as provider_count] |
└─ Sort(COUNT(*) DESC) |
└─ GroupBy |
├─ SelectedExprs(p.taxonomy_code, COUNT(*)) |
├─ Grouping(p.taxonomy_code) |
└─ SubqueryAlias(p) |
└─ Project |
├─ columns: [pro.npi, ifnull(ext.taxonomy_code, 'UNKNOWN') as taxonomy_code] |
└─ LeftIndexedJoin((pro.npi = ext.npi) AND (ext.primary_taxonomy_switch IN ('Y', 'X'))) |
├─ Exchange |
│ └─ TableAlias(pro) |
│ └─ Table(npi_provider) |
│ └─ columns: [npi] |
└─ TableAlias(ext) |
└─ IndexedTableAccess(npi_provider_extra) |
├─ index: [npi_provider_extra.npi,npi_provider_extra.idx] |
└─ columns: [npi taxonomy_code primary_taxonomy_switch] |
Also noticed that the CTE returns different result each time I run it. Here's the log I get with mysql-connector-java:
$ dolt sql-server -u jrs -p password --config dolt.cfg
Starting server with Config HP="localhost:3306"|T="28800000"|R="false"|L="debug"
2022-09-21T20:26:56-04:00 INFO [conn 1] NewConnection {DisableClientMultiStatements=false}
2022-09-21T20:27:02-04:00 INFO [conn 2] NewConnection {DisableClientMultiStatements=false}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=/* mysql-connector-java-8.0.25 (Revision: 08be9e9b4cba6aa115f9b27b215887af40b159e0) */SELECT @@session.auto_increment_increment AS auto_increment_increment, @@character_set_client AS character_set_client, @@character_set_connection AS character_set_connection, @@character_set_results AS character_set_results, @@character_set_server AS character_set_server, @@collation_server AS collation_server, @@collation_connection AS collation_connection, @@init_connect AS init_connect, @@interactive_timeout AS interactive_timeout, @@license AS license, @@lower_case_table_names AS lower_case_table_names, @@max_allowed_packet AS max_allowed_packet, @@net_write_timeout AS net_write_timeout, @@performance_schema AS performance_schema, @@query_cache_size AS query_cache_size, @@query_cache_type AS query_cache_type, @@sql_mode AS sql_mode, @@system_time_zone AS system_time_zone, @@time_zone AS time_zone, @@tx_isolation AS transaction_isolation, @@wait_timeout AS wait_timeout}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 1 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=/* mysql-connector-java-8.0.25 (Revision: 08be9e9b4cba6aa115f9b27b215887af40b159e0) */SELECT @@session.auto_increment_increment AS auto_increment_increment, @@character_set_client AS character_set_client, @@character_set_connection AS character_set_connection, @@character_set_results AS character_set_results, @@character_set_server AS character_set_server, @@collation_server AS collation_server, @@collation_connection AS collation_connection, @@init_connect AS init_connect, @@interactive_timeout AS interactive_timeout, @@license AS license, @@lower_case_table_names AS lower_case_table_names, @@max_allowed_packet AS max_allowed_packet, @@net_write_timeout AS net_write_timeout, @@performance_schema AS performance_schema, @@query_cache_size AS query_cache_size, @@query_cache_type AS query_cache_type, @@sql_mode AS sql_mode, @@system_time_zone AS system_time_zone, @@time_zone AS time_zone, @@tx_isolation AS transaction_isolation, @@wait_timeout AS wait_timeout}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=SET NAMES utf8mb4}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 0 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=SET NAMES utf8mb4}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=SET character_set_results = utf8}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 0 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=SET character_set_results = utf8}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=SET autocommit=1}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 0 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=SET autocommit=1}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=select version(), @@version_comment, database()}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 1 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=select version(), @@version_comment, database()}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=/* ApplicationName=PyCharm 2022.2.2 */ set session transaction read write}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 0 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=/* ApplicationName=PyCharm 2022.2.2 */ set session transaction read write}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=SHOW WARNINGS}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 0 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=SHOW WARNINGS}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=SELECT @@session.tx_read_only}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 1 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=SELECT @@session.tx_read_only}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=/* ApplicationName=PyCharm 2022.2.2 */ use npi_registry}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 2 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=/* ApplicationName=PyCharm 2022.2.2 */ use npi_registry}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=SHOW WARNINGS}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 1 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=SHOW WARNINGS}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=SHOW WARNINGS}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 0 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=SHOW WARNINGS}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=/* ApplicationName=PyCharm 2022.2.2 */ select database()}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 0 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=/* ApplicationName=PyCharm 2022.2.2 */ select database()}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=SHOW WARNINGS}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 1 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=SHOW WARNINGS}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=SHOW WARNINGS}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 0 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=SHOW WARNINGS}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=/* ApplicationName=PyCharm 2022.2.2 */ SET net_write_timeout=600}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 0 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=/* ApplicationName=PyCharm 2022.2.2 */ SET net_write_timeout=600}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=/* ApplicationName=PyCharm 2022.2.2 */ SET SQL_SELECT_LIMIT=501}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 1 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=/* ApplicationName=PyCharm 2022.2.2 */ SET SQL_SELECT_LIMIT=501}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=/* ApplicationName=PyCharm 2022.2.2 */ WITH providers AS ( SELECT pro.npi , IFNULL(ext.taxonomy_code, 'UNKNOWN') AS taxonomy_code FROM npi_provider pro LEFT JOIN npi_provider_extra ext ON (pro.npi = ext.npi AND ext.primary_taxonomy_switch IN ('Y', 'X')) ) SELECT p.taxonomy_code , COUNT(*) AS provider_count FROM providers AS p GROUP BY p.taxonomy_code ORDER BY COUNT(*) DESC LIMIT 10}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 284 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=/* ApplicationName=PyCharm 2022.2.2 */ WITH providers AS ( SELECT pro.npi , IFNULL(ext.taxonomy_code, 'UNKNOWN') AS taxonomy_code FROM npi_provider pro LEFT JOIN npi_provider_extra ext ON (pro.npi = ext.npi AND ext.primary_taxonomy_switch IN ('Y', 'X')) ) SELECT p.taxonomy_code , COUNT(*) AS provider_count FROM providers AS p GROUP BY p.taxonomy_code ORDER BY COUNT(*) DESC LIMIT 10}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=SHOW WARNINGS}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 2 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=SHOW WARNINGS}
Thank you for the additional details @snipesjr ! Made it much easier to figure out what's going on here.
I think the temporary workaround is to use ORDER BY provider_count
:
> WITH providers AS (
SELECT pro.npi
, IFNULL(ext.taxonomy_code, 'UNKNOWN') AS taxonomy_code
FROM npi_provider pro
LEFT JOIN npi_provider_extra ext
ON (pro.npi = ext.npi
AND ext.primary_taxonomy_switch IN ('Y', 'X'))
)
SELECT p.taxonomy_code
, COUNT(*) AS provider_count
FROM providers AS p
GROUP BY p.taxonomy_code
ORDER BY provider_count DESC
LIMIT 10;
+---------------+----------------+
| taxonomy_code | provider_count |
+---------------+----------------+
| UNKNOWN | 1914495 |
| 183500000X | 197214 |
| 1041C0700X | 196532 |
| 225100000X | 185787 |
| 101YM0800X | 185663 |
| 390200000X | 170594 |
| 106S00000X | 169258 |
| 207Q00000X | 146182 |
| 363LF0000X | 139425 |
| 207R00000X | 136415 |
+---------------+----------------+
10 rows in set (370.44 sec)
With the COUNT(*) query, we appear to push the sort too far. So instead of the results being sorted on provider_count
, the results are being sorted randomly because it's counting the whole groupby result set:
-- old
Limit(10)
└─ Project
├─ columns: [p.taxonomy_code, COUNT(*) as provider_count]
└─ Sort(COUNT(*) DESC)
└─ GroupBy
-- new
Limit(10)
└─ TopN(Limit: [10]; provider_count DESC)
└─ Project
├─ columns: [p.taxonomy_code, COUNT(*) as provider_count]
└─ GroupBy
This does not explain why the two queries are equivalent without the Limit, so I need to do more testing. I am working on a smaller repro to dig more and confirm that MySQL infers that the two COUNT(*) are the same expression.
Thanks for the update!
Unfortunately, swapping to provider_count
has the same result (random sort). The query plan does change though:
plan |
---|
Limit(10) |
└─ TopN(Limit: [10]; provider_count DESC) |
└─ Project |
├─ columns: [p.taxonomy_code, COUNT(*) as provider_count] |
└─ GroupBy |
├─ SelectedExprs(p.taxonomy_code, COUNT(*)) |
├─ Grouping(p.taxonomy_code) |
└─ SubqueryAlias(p) |
└─ Project |
├─ columns: [pro.npi, ifnull(ext.taxonomy_code, 'UNKNOWN') as taxonomy_code] |
└─ LeftIndexedJoin((pro.npi = ext.npi) AND (ext.primary_taxonomy_switch IN ('Y', 'X'))) |
├─ Exchange |
│ └─ TableAlias(pro) |
│ └─ Table(npi_provider) |
│ └─ columns: [npi] |
└─ TableAlias(ext) |
└─ IndexedTableAccess(npi_provider_extra) |
├─ index: [npi_provider_extra.npi,npi_provider_extra.idx] |
└─ columns: [npi taxonomy_code primary_taxonomy_switch] |
Interestingly, I get the random sort without the limit as well; however I see that pycharm is doing a SET SQL SELECT_LIMIT=DEFAULT
before executing my query.
Looks like we are selecting 501 random rows from the left join. We can manually disable parallelism (SINGLE_THREAD_FEATURE_FLAG dolt sql
), which will give deterministic results, but I presume you want the full result set from the join before applying any limits.
Running a few comparison queries against MySQL, I think the correct behavior is to 1) only apply the default limit to the top-level query, and 2) use a query's LIMIT clause instead of the default when provided.
Alright two bugs:
-
We need to infer the COUNT(*) sort alias and/or avoid pushing the sort below its enforcement scope
-
sql_select_limit
should only do one thing, apply a limit to the top-level scope if it does not already have one. We should not apply limits to arbitrary scopes.
In the meantime, disabling sql_select_limit
should be a workaround for (2).
Thank you for the follow up information! Let us know if there are any other bugs we should clean up here.
@snipesjr I think this one should fix the sql_select_limit
issue, available in the next release:
https://github.com/dolthub/go-mysql-server/pull/1276
Let us know if you run into any more trouble.
The sql_select_limit
fix should be included in the most recent release here https://github.com/dolthub/dolt/releases/tag/v0.41.6.