The estimation of string/bytes types mv index within a single bucket of the histogram is not accurate
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
SELECT
-> *
-> FROM
-> items
-> WHERE
-> 57260686 member of (feed_profile_ids)
-> AND "[OC8p1763XTkt.org/s/link](http://oc8p1763xtkt.org/s/link)" member of (normalized_short_link_clusters)
-> AND json_overlaps(
-> product_sources,
-> '["M1","M2","M3"]'
-> )
-> LIMIT
-> 1 ;
2. What did you expect to see? (Required)
| id | estRows | estCost | actRows | task | access object | execution info | operator info | memory | disk |
| Limit_10 | 0.00 | 673.26 | 0 | root | | time:13.4ms, loops:1 | offset:0, count:10 | N/A | N/A |
| └─Selection_11 | 0.00 | 673.26 | 0 | root | | time:13.4ms, loops:1 | json_overlaps(test.items.product_sources, cast("["FEED_MERCHANT","MERCHANT_UI","CAROUSEL_CRAWLER"]", json BINARY)) | 33.3 KB | N/A |
| └─IndexMerge_18 | 0.00 | 673.26 | 1 | root | | time:13.3ms, loops:2, index_task:{fetch_handle:3.455965ms, merge:5.423µs}, table_task:{num:1, concurrency:5, fetch_row:9.799043ms, wait_time:3.514198ms} | type: union | 49.9 KB | N/A |
| ├─IndexRangeScan_15(Build) | 8.30 | 2145.61 | 10 | cop[tikv] | table:items, index:feed_profile_ids(cast(`feed_profile_ids` as unsigned array)) | time:3.45ms, loops:3, cop_task: {num: 1, max: 3.42ms, proc_keys: 10, tot_proc: 569.6µs, tot_wait: 2.56ms, rpc_num: 1, rpc_time: 3.41ms, copr_cache_hit_ratio: 0.00, build_task_duration: 13.2µs, max_distsql_concurrency: 1}, tikv_task:{time:0s, loops:1}, scan_detail: {total_process_keys: 10, total_process_keys_size: 1020, total_keys: 11, get_snapshot_time: 9.56µs, rocksdb: {key_skipped_count: 10, block: {cache_hit_count: 1, read_count: 1, read_byte: 13.4 KB, read_time: 445.9µs}}} | range:[59405969,59405969], keep order:false, stats:partial[normalized_short_link_clusters:unInitialized, feed_profile_ids:unInitialized] | N/A | N/A |
| └─Selection_17(Probe) | 0.00 | 5271.80 | 1 | cop[tikv] | | time:9.68ms, loops:2, cop_task: {num: 10, max: 7.95ms, min: 0s, avg: 1.45ms, p95: 7.95ms, max_proc_keys: 1, p95_proc_keys: 1, tot_proc: 3.66ms, tot_wait: 10.4ms, rpc_num: 6, rpc_time: 14.4ms, copr_cache_hit_ratio: 0.00, build_task_duration: 60µs, max_distsql_concurrency: 1, max_extra_concurrency: 3, store_batch_num: 4}, tikv_task:{proc max:1ms, min:0s, avg: 300µs, p80:1ms, p95:1ms, iters:10, tasks:10}, scan_detail: {total_process_keys: 10, total_process_keys_size: 206864, total_keys: 20, get_snapshot_time: 115.4µs, rocksdb: {key_skipped_count: 10, block: {cache_hit_count: 117, read_count: 4, read_byte: 97.2 KB, read_time: 674.7µs}}} | json_memberof(cast("OC8p1171XTkt.net/s/link", json BINARY), test.items.normalized_short_link_clusters) | N/A | N/A |
| └─TableRowIDScan_16 | 8.30 | 4857.52 | 10 | cop[tikv] | table:items | tikv_task:{proc max:1ms, min:0s, avg: 300µs, p80:1ms, p95:1ms, iters:10, tasks:10} | keep order:false, stats:partial[normalized_short_link_clusters:unInitialized, feed_profile_ids:unInitialized] | N/A | N/A |
3. What did you see instead (Required)
| id | estRows | estCost | actRows | task | access object | execution info | operator info | memory | disk |
| Limit_10 | 0.00 | 762.16 | 0 | root | | time:2.49s, loops:1 | offset:0, count:10 | N/A | N/A |
| └─Selection_11 | 0.00 | 762.16 | 0 | root | | time:2.49s, loops:1 | json_overlaps(test.items.product_sources, cast("["FEED_MERCHANT","MERCHANT_UI","CAROUSEL_CRAWLER"]", json BINARY)) | 33.3 KB | N/A |
| └─IndexMerge_18 | 0.00 | 762.16 | 1 | root | | time:2.49s, loops:2, index_task:{fetch_handle:1.986440012s, merge:279.519351ms}, table_task:{num:28, concurrency:5, fetch_row:8.100928191s, wait_time:4.101294918s} | type: union | 30.5 MB | N/A |
| ├─IndexRangeScan_15(Build) | 10.88 | 2883.70 | 492380 | cop[tikv] | table:items, index:normalized_short_link_clusters(cast(`normalized_short_link_clusters` as char(1000) array), point_of_sale_country) | time:1.86s, loops:485, cop_task: {num: 18, max: 427ms, min: 640.3µs, avg: 110.3ms, p95: 427ms, max_proc_keys: 50144, p95_proc_keys: 50144, tot_proc: 505.5ms, tot_wait: 54.5ms, rpc_num: 18, rpc_time: 1.98s, copr_cache_hit_ratio: 0.00, build_task_duration: 1.08ms, max_distsql_concurrency: 1}, tikv_task:{proc max:400ms, min:0s, avg: 103.3ms, p80:192ms, p95:400ms, iters:552, tasks:18}, scan_detail: {total_process_keys: 492380, total_process_keys_size: 66039117, total_keys: 504126, get_snapshot_time: 127.8µs, rocksdb: {delete_skipped_count: 101, key_skipped_count: 504209, block: {cache_hit_count: 252, read_count: 1552, read_byte: 17.2 MB, read_time: 109.2ms}}} | range:["OC8p1632XTkt.net/s/link","OC8p1632XTkt.net/s/link"], keep order:false, stats:partial[normalized_short_link_clusters:unInitialized, feed_profile_ids:unInitialized] | N/A | N/A |
| └─Selection_17(Probe) | 0.00 | 4615.94 | 1 | cop[tikv] | | time:7.38s, loops:29, cop_task: {num: 265142, max: 2.39ms, min: 0s, avg: 7.14ms, p95: 22.3ms, max_proc_keys: 9, p95_proc_keys: 4, tot_proc: 3m44.6s, tot_wait: 1h18m23.6s, rpc_num: 53197, rpc_time: 31m33s, copr_cache_hit_ratio: 0.00, build_task_duration: 563ms, max_distsql_concurrency: 1, max_extra_concurrency: 320, store_batch_num: 211945}, tikv_task:{proc max:0s, min:0s, avg: 6.41ms, p80:8ms, p95:32ms, iters:265142, tasks:265142}, scan_detail: {total_process_keys: 492380, total_process_keys_size: 10172790635, total_keys: 984760, get_snapshot_time: 24.3s, rocksdb: {key_skipped_count: 492380, block: {cache_hit_count: 8125571, read_count: 645430, read_byte: 13.7 GB, read_time: 1m46.1s}}} | json_memberof(cast(80638367, json BINARY), test.items.feed_profile_ids) | N/A | N/A |
| └─TableRowIDScan_16 | 10.88 | 4073.03 | 492380 | cop[tikv] | table:items | tikv_task:{proc max:0s, min:0s, avg: 6.41ms, p80:8.1ms, p95:32ms, iters:265142, tasks:265142} | keep order:false, stats:partial[normalized_short_link_clusters:unInitialized, feed_profile_ids:unInitialized] | N/A | N/A |
The estimation of string/bytes types within a single bucket of the histogram is not accurate
4. What is your TiDB version? (Required)
7.6.0
We need to find a way to store the single-column stats for the mv index.
https://github.com/pingcap/tidb/pull/4601 After this tidb will precision-loss cast a user string as scalar value, so as inBucket's L and R bound, then use uniform distribution to calculate a ratio between these three precision-loss scalar values.
finally we use hg.NotNullCount()/float64(hg.NDV)) as the count est of the prefix json column‘s EQ
tidb> explain SELECT /*+ use_index_merge(items, normalized_short_link_clusters), IGNORE_PLAN_CACHE() */ * FROM items WHERE 57260686 member of (feed_profile_ids) AND "OC8p1763XTkt.org/s/link" member of (normalized_short_link_clusters) AND json_overlaps( product_sources, '["FEED_MERCHANT","MERCHANT_UI","CAROUSEL_CRAWLER"]' ) LIMIT 1;
+------------------------------------+---------+-----------+--------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id | estRows | task | access object | operator info |
+------------------------------------+---------+-----------+--------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Limit_10 | 0.00 | root | | offset:0, count:1 |
| └─Selection_11 | 0.00 | root | | json_overlaps(test.items.product_sources, cast("["FEED_MERCHANT","MERCHANT_UI","CAROUSEL_CRAWLER"]", json BINARY)) |
| └─IndexMerge_15 | 0.00 | root | | type: union |
| ├─IndexRangeScan_12(Build) | 10.85 | cop[tikv] | table:items, index:normalized_short_link_clusters(cast(`normalized_short_link_clusters` as char(1000) array), point_of_sale_country) | range:["OC8p1763XTkt.org/s/link","OC8p1763XTkt.org/s/link"], keep order:false, stats:partial[normalized_short_link_clusters:unInitialized, feed_profile_ids:unInitialized] |
| └─Selection_14(Probe) | 0.00 | cop[tikv] | | json_memberof(cast(57260686, json BINARY), test.items.feed_profile_ids) |
| └─TableRowIDScan_13 | 10.85 | cop[tikv] | table:items | keep order:false, stats:partial[normalized_short_link_clusters:unInitialized, feed_profile_ids:unInitialized] |
+------------------------------------+---------+-----------+--------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
6 rows in set, 5 warnings (6 min 1.12 sec)
We need to find a way to store the single-column stats for the mv index.
two strings/bytes fall into the same bucket, use the first Byte of the slice of bytes[commonPrefixLen:] to be cast as scalar value will easily get the same lessCountA and lessCountB as the picture shows, consequently leading to return hg.NotNullCount()/float64(hg.NDV)), seems single column stats has the same problem here.