The Plan of TPCH Q3 changes without any data update leading to 13s performance regression
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
- deploy a tidb cluster: 1 tidb (16c) + 3 TiKV(16c) + 1 PD
- restore tpch 50g data
- run tpch for 30 mins
2. What did you expect to see? (Required)
The plans of all queries would not change and the plan for Q3 should be stuck to d14988835227e68de9bb1194760cee8e.
3. What did you see instead (Required)
The plan for Q3 would change in some of the daily runs.

TPCH Q3
q3 = `
/*PLACEHOLDER*/ select
l_orderkey,
sum(l_extendedprice * (1 - l_discount)) as revenue,
o_orderdate,
o_shippriority
from
customer,
orders,
lineitem
where
c_mktsegment = 'AUTOMOBILE'
and c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate < '1995-03-13'
and l_shipdate > '1995-03-13'
group by
l_orderkey,
o_orderdate,
o_shippriority
order by
revenue desc,
o_orderdate
limit 10;
`
Olap_Detail_Log_ID: 2792562 Plan_Digest: d14988835227e68de9bb1194760cee8e Elapsed_Time (s): 26.5
+--------------------------------------+-------------+----------+-----------+---------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
| ID | ESTROWS | ACTROWS | TASK | ACCESS OBJECT | EXECUTION INFO | OPERATOR INFO | MEMORY | DISK |
+--------------------------------------+-------------+----------+-----------+---------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
| Projection_14 | 10.00 | 10 | root | | time:26.5s, loops:2, Concurrency:OFF | test.lineitem.l_orderkey, Column#35, test.orders.o_orderdate, test.orders.o_shippriority | 2.52 KB | N/A |
| └─TopN_17 | 10.00 | 10 | root | | time:26.5s, loops:2 | Column#35:desc, test.orders.o_orderdate, offset:0, count:10 | 76.8 KB | N/A |
| └─HashAgg_22 | 39991142.90 | 565763 | root | | time:26.5s, loops:555, partial_worker:{wall_time:26.060572975s, concurrency:5, task_num:1461, tot_wait:2m8.697732643s, tot_exec:1.436736417s, tot_time:2m10.299361175s, max:26.060536947s, p95:26.060536947s}, final_worker:{wall_time:26.522530518s, concurrency:5, task_num:25, tot_wait:2m10.294821085s, tot_exec:2.216177038s, tot_time:2m12.511018685s, max:26.522479673s, p95:26.522479673s} | group by:Column#48, Column#49, Column#50, funcs:sum(Column#44)->Column#35, funcs:firstrow(Column#45)->test.orders.o_orderdate, funcs:firstrow(Column#46)->test.orders.o_shippriority, funcs:firstrow(Column#47)->test.lineitem.l_orderkey | 378.4 MB | N/A |
| └─Projection_82 | 92857210.61 | 1495049 | root | | time:26s, loops:1462, Concurrency:5 | mul(test.lineitem.l_extendedprice, minus(1, test.lineitem.l_discount))->Column#44, test.orders.o_orderdate, test.orders.o_shippriority, test.lineitem.l_orderkey, test.lineitem.l_orderkey, test.orders.o_orderdate, test.orders.o_shippriority | 1.09 MB | N/A |
| └─IndexHashJoin_30 | 92857210.61 | 1495049 | root | | time:26s, loops:1462, inner:{total:2m3.7s, concurrency:5, task:292, construct:8.58s, fetch:1m51.9s, build:1.94s, join:3.19s} | inner join, inner:IndexLookUp_27, outer key:test.orders.o_orderkey, inner key:test.lineitem.l_orderkey, equal cond:eq(test.orders.o_orderkey, test.lineitem.l_orderkey) | 35.5 MB | N/A |
| ├─HashJoin_70(Build) | 22875928.63 | 7274323 | root | | time:6.91s, loops:7108, build_hash_table:{total:891.1ms, fetch:158.5ms, build:732.6ms}, probe:{concurrency:5, total:2m7.2s, max:25.4s, probe:1m53.5s, fetch:13.7s} | inner join, equal:[eq(test.customer.c_custkey, test.orders.o_custkey)] | 141.3 MB | 0 Bytes |
| │ ├─TableReader_76(Build) | 1502320.19 | 1501166 | root | | time:214.1ms, loops:1463, cop_task: {num: 150, max: 199.7ms, min: 1.34ms, avg: 51.1ms, p95: 173.9ms, max_proc_keys: 185477, p95_proc_keys: 183486, tot_proc: 7.46s, tot_wait: 24ms, rpc_num: 150, rpc_time: 7.66s, copr_cache_hit_ratio: 0.00, distsql_concurrency: 15} | data:Selection_75 | 9.18 MB | N/A |
| │ │ └─Selection_75 | 1502320.19 | 1501166 | cop[tikv] | | tikv_task:{proc max:193ms, min:0s, avg: 48.8ms, p80:98ms, p95:168ms, iters:7937, tasks:150}, scan_detail: {total_process_keys: 7500000, total_process_keys_size: 1526085547, total_keys: 7500150, get_snapshot_time: 14.3ms, rocksdb: {key_skipped_count: 7500000, block: {cache_hit_count: 25190}}} | eq(test.customer.c_mktsegment, "AUTOMOBILE") | N/A | N/A |
| │ │ └─TableFullScan_74 | 7500000.00 | 7500000 | cop[tikv] | table:customer | tikv_task:{proc max:169ms, min:0s, avg: 43.1ms, p80:87ms, p95:149ms, iters:7937, tasks:150} | keep order:false | N/A | N/A |
| │ └─TableReader_73(Probe) | 36347384.33 | 36374625 | root | | time:2.31s, loops:35402, cop_task: {num: 1615, max: 272.3ms, min: 1.22ms, avg: 63.1ms, p95: 173.8ms, max_proc_keys: 104416, p95_proc_keys: 104416, tot_proc: 1m37.8s, tot_wait: 597ms, rpc_num: 1615, rpc_time: 1m41.9s, copr_cache_hit_ratio: 0.00, distsql_concurrency: 15} | data:Selection_72 | 24.6 MB | N/A |
| │ └─Selection_72 | 36347384.33 | 36374625 | cop[tikv] | | tikv_task:{proc max:234ms, min:0s, avg: 57.5ms, p80:117ms, p95:160ms, iters:79749, tasks:1615}, scan_detail: {total_process_keys: 75000000, total_process_keys_size: 11391895327, total_keys: 75001615, get_snapshot_time: 84.9ms, rocksdb: {key_skipped_count: 75000000, block: {cache_hit_count: 172276, read_count: 20178, read_byte: 337.6 MB, read_time: 146.8ms}}} | lt(test.orders.o_orderdate, 1995-03-13 00:00:00.000000) | N/A | N/A |
| │ └─TableFullScan_71 | 75000000.00 | 75000000 | cop[tikv] | table:orders | tikv_task:{proc max:223ms, min:0s, avg: 54ms, p80:109ms, p95:150ms, iters:79749, tasks:1615} | keep order:false | N/A | N/A |
| └─IndexLookUp_27(Probe) | 4.06 | 1495049 | root | | time:1m45.4s, loops:1958, index_task: {total_time: 1m24.1s, fetch_handle: 1m24.1s, build: 4.73ms, wait: 13.1ms}, table_task: {total_time: 1m51.7s, num: 2583, concurrency: 5} | | 155.7 KB | N/A |
| ├─IndexRangeScan_24(Build) | 7.50 | 29096047 | cop[tikv] | table:lineitem, index:PRIMARY(L_ORDERKEY, L_LINENUMBER) | time:1m21.7s, loops:30459, cop_task: {num: 12290, max: 156.9ms, min: 482.9µs, avg: 21ms, p95: 58.5ms, max_proc_keys: 26443, p95_proc_keys: 9184, tot_proc: 2m58s, tot_wait: 13.5s, rpc_num: 12290, rpc_time: 4m18.1s, copr_cache_hit_ratio: 0.00, distsql_concurrency: 15}, tikv_task:{proc max:135ms, min:0s, avg: 13.6ms, p80:22ms, p95:48ms, iters:74689, tasks:12290}, scan_detail: {total_process_keys: 29096047, total_process_keys_size: 1542090491, total_keys: 36380592, get_snapshot_time: 931ms, rocksdb: {key_skipped_count: 29096047, block: {cache_hit_count: 14364485, read_count: 226140, read_byte: 1023.0 MB, read_time: 1.03s}}} | range: decided by [eq(test.lineitem.l_orderkey, test.orders.o_orderkey)], keep order:false | N/A | N/A |
| └─Selection_26(Probe) | 4.06 | 1495049 | cop[tikv] | | time:1m41.8s, loops:5703, cop_task: {num: 10316, max: 193.4ms, min: 379.4µs, avg: 20.8ms, p95: 63.1ms, max_proc_keys: 20984, p95_proc_keys: 9360, tot_proc: 2m59.9s, tot_wait: 13.6s, rpc_num: 10316, rpc_time: 3m33.9s, copr_cache_hit_ratio: 0.00, distsql_concurrency: 15}, tikv_task:{proc max:186ms, min:0s, avg: 16.5ms, p80:28ms, p95:56ms, iters:73336, tasks:10316}, scan_detail: {total_process_keys: 29096047, total_process_keys_size: 5779934528, total_keys: 34821877, get_snapshot_time: 1.43s, rocksdb: {key_skipped_count: 28246157, block: {cache_hit_count: 12914527, read_count: 59918, read_byte: 1.29 GB, read_time: 697.6ms}}} | gt(test.lineitem.l_shipdate, 1995-03-13 00:00:00.000000) | N/A | N/A |
| └─TableRowIDScan_25 | 7.50 | 29096047 | cop[tikv] | table:lineitem | tikv_task:{proc max:185ms, min:0s, avg: 16.2ms, p80:28ms, p95:55ms, iters:73336, tasks:10316} | keep order:false | N/A | N/A |
+--------------------------------------+-------------+----------+-----------+---------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
Olap_Detail_Log_ID: 2792540 Plan_Digest: f8e52347ef089dc357e3ff1704d5415e Elapsed_Time (s): 39.0
+-----------------------------------+--------------+-----------+-----------+----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
| ID | ESTROWS | ACTROWS | TASK | ACCESS OBJECT | EXECUTION INFO | OPERATOR INFO | MEMORY | DISK |
+-----------------------------------+--------------+-----------+-----------+----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
| Projection_14 | 10.00 | 10 | root | | time:39s, loops:2, Concurrency:OFF | test.lineitem.l_orderkey, Column#35, test.orders.o_orderdate, test.orders.o_shippriority | 2.52 KB | N/A |
| └─TopN_17 | 10.00 | 10 | root | | time:39s, loops:2 | Column#35:desc, test.orders.o_orderdate, offset:0, count:10 | 76.8 KB | N/A |
| └─HashAgg_22 | 39759090.21 | 565763 | root | | time:39s, loops:555, partial_worker:{wall_time:38.571025254s, concurrency:5, task_num:1461, tot_wait:3m11.057657942s, tot_exec:1.598260816s, tot_time:3m12.821384773s, max:38.570971871s, p95:38.570971871s}, final_worker:{wall_time:39.010329063s, concurrency:5, task_num:25, tot_wait:3m12.77921226s, tot_exec:2.193400817s, tot_time:3m14.97263345s, max:39.010246209s, p95:39.010246209s} | group by:Column#48, Column#49, Column#50, funcs:sum(Column#44)->Column#35, funcs:firstrow(Column#45)->test.orders.o_orderdate, funcs:firstrow(Column#46)->test.orders.o_shippriority, funcs:firstrow(Column#47)->test.lineitem.l_orderkey | 378.4 MB | N/A |
| └─Projection_82 | 92857210.61 | 1495049 | root | | time:38.5s, loops:1462, Concurrency:5 | mul(test.lineitem.l_extendedprice, minus(1, test.lineitem.l_discount))->Column#44, test.orders.o_orderdate, test.orders.o_shippriority, test.lineitem.l_orderkey, test.lineitem.l_orderkey, test.orders.o_orderdate, test.orders.o_shippriority | 1.09 MB | N/A |
| └─HashJoin_39 | 92857210.61 | 1495049 | root | | time:38.5s, loops:1462, build_hash_table:{total:9.95s, fetch:5.45s, build:4.49s}, probe:{concurrency:5, total:3m12.6s, max:38.5s, probe:54.8s, fetch:2m17.8s} | inner join, equal:[eq(test.orders.o_orderkey, test.lineitem.l_orderkey)] | 567.5 MB | 0 Bytes |
| ├─HashJoin_70(Build) | 22875928.63 | 7274323 | root | | time:7.69s, loops:7107, build_hash_table:{total:899.1ms, fetch:108ms, build:791.1ms}, probe:{concurrency:5, total:49.7s, max:9.94s, probe:27.5s, fetch:22.3s} | inner join, equal:[eq(test.customer.c_custkey, test.orders.o_custkey)] | 141.3 MB | 0 Bytes |
| │ ├─TableReader_76(Build) | 1502320.19 | 1501166 | root | | time:193ms, loops:1463, cop_task: {num: 150, max: 189ms, min: 805.7µs, avg: 51.9ms, p95: 173.7ms, max_proc_keys: 185477, p95_proc_keys: 183486, tot_proc: 7.57s, tot_wait: 26ms, rpc_num: 150, rpc_time: 7.78s, copr_cache_hit_ratio: 0.00, distsql_concurrency: 15} | data:Selection_75 | 10.5 MB | N/A |
| │ │ └─Selection_75 | 1502320.19 | 1501166 | cop[tikv] | | tikv_task:{proc max:181ms, min:1ms, avg: 49.4ms, p80:117ms, p95:167ms, iters:7937, tasks:150}, scan_detail: {total_process_keys: 7500000, total_process_keys_size: 1526085547, total_keys: 7500150, get_snapshot_time: 15ms, rocksdb: {key_skipped_count: 7500000, block: {cache_hit_count: 25190}}} | eq(test.customer.c_mktsegment, "AUTOMOBILE") | N/A | N/A |
| │ │ └─TableFullScan_74 | 7500000.00 | 7500000 | cop[tikv] | table:customer | tikv_task:{proc max:159ms, min:0s, avg: 43.4ms, p80:104ms, p95:148ms, iters:7937, tasks:150} | keep order:false | N/A | N/A |
| │ └─TableReader_73(Probe) | 36347384.33 | 36374625 | root | | time:4s, loops:35403, cop_task: {num: 1615, max: 284.6ms, min: 1.41ms, avg: 69ms, p95: 165.6ms, max_proc_keys: 104416, p95_proc_keys: 104416, tot_proc: 1m47.7s, tot_wait: 168ms, rpc_num: 1615, rpc_time: 1m51.5s, copr_cache_hit_ratio: 0.00, distsql_concurrency: 15} | data:Selection_72 | 13.9 MB | N/A |
| │ └─Selection_72 | 36347384.33 | 36374625 | cop[tikv] | | tikv_task:{proc max:216ms, min:0s, avg: 64.4ms, p80:132ms, p95:157ms, iters:79749, tasks:1615}, scan_detail: {total_process_keys: 75000000, total_process_keys_size: 11391895327, total_keys: 75001615, get_snapshot_time: 104.3ms, rocksdb: {key_skipped_count: 75000000, block: {cache_hit_count: 3121, read_count: 189333, read_byte: 3.09 GB, read_time: 1.28s}}} | lt(test.orders.o_orderdate, 1995-03-13 00:00:00.000000) | N/A | N/A |
| │ └─TableFullScan_71 | 75000000.00 | 75000000 | cop[tikv] | table:orders | tikv_task:{proc max:203ms, min:0s, avg: 61.5ms, p80:126ms, p95:150ms, iters:79749, tasks:1615} | keep order:false | N/A | N/A |
| └─TableReader_79(Probe) | 161388779.98 | 161995407 | root | | time:19.9s, loops:157662, cop_task: {num: 7962, max: 359.2ms, min: 855.3µs, avg: 50.4ms, p95: 136.1ms, max_proc_keys: 95200, p95_proc_keys: 94176, tot_proc: 5m57.4s, tot_wait: 833ms, rpc_num: 7962, rpc_time: 6m40.9s, copr_cache_hit_ratio: 0.00, distsql_concurrency: 15} | data:Selection_78 | 29.2 MB | N/A |
| └─Selection_78 | 161388779.98 | 161995407 | cop[tikv] | | tikv_task:{proc max:173ms, min:0s, avg: 40.4ms, p80:87ms, p95:112ms, iters:324826, tasks:7962}, scan_detail: {total_process_keys: 300005811, total_process_keys_size: 59595430182, total_keys: 300013773, get_snapshot_time: 482.3ms, rocksdb: {key_skipped_count: 300005811, block: {cache_hit_count: 803520, read_count: 185706, read_byte: 2.93 GB, read_time: 1.27s}}} | gt(test.lineitem.l_shipdate, 1995-03-13 00:00:00.000000) | N/A | N/A |
| └─TableFullScan_77 | 300005811.00 | 300005811 | cop[tikv] | table:lineitem | tikv_task:{proc max:163ms, min:0s, avg: 38.2ms, p80:83ms, p95:106ms, iters:324826, tasks:7962} | keep order:false | N/A | N/A |
+-----------------------------------+--------------+-----------+-----------+----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
4. What is your TiDB version? (Required)
nightly
The difference between good plan and bad plan is the Join and datasource for lineItem. For good plan, it's IndexHashJoin + IndexLookUp, and for bad plan it's HashJoin + TableReader.
Both cost in the plan replayer for good plan like following:

Both cost in the plan replayer for bad plan like following:

As you can see, the difference between the cost of indexHashJoin and HashJoin is quite small which caused the q3 plan unstable.
Reason why the plan is unstable: 1) costs of these 2 plans (HJ and INLJ) are too close in cost-model-ver1 (as @Yisaer said), 2) the randomness of statistics is amplified in join estimation.

The main reason to cause the optimizer cannot select the right plan INLJ stably in this case is that the join estimation is over-estimated a lot, which let the optimizer prefer to use HJ instead of INLJ.

This issue can be solved by Cost Model Ver2.
Confirmed the issue was solved. Close the issue.
/close
@Yui-Song: Closing this issue.
In response to this:
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.