tidb The estimation of string/bytes types mv index within a single bucket of the histogram is not accurate

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

SELECT 
    ->   *
    -> FROM
    ->   items
    -> WHERE
    ->   57260686 member of (feed_profile_ids)
    ->   AND "[OC8p1763XTkt.org/s/link](http://oc8p1763xtkt.org/s/link)" member of (normalized_short_link_clusters)
    ->   AND json_overlaps(
    ->     product_sources,
    ->     '["M1","M2","M3"]'
    ->   )
    -> LIMIT
    ->   1 ;

2. What did you expect to see? (Required)


| id                             | estRows | estCost | actRows | task      | access object                                                                   | execution info                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | operator info                                                                                                                            | memory  | disk  |
| Limit_10                       | 0.00    | 673.26  | 0       | root      |                                                                                 | time:13.4ms, loops:1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | offset:0, count:10                                                                                                                       | N/A     | N/A   |
| └─Selection_11                 | 0.00    | 673.26  | 0       | root      |                                                                                 | time:13.4ms, loops:1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | json_overlaps(test.items.product_sources, cast("["FEED_MERCHANT","MERCHANT_UI","CAROUSEL_CRAWLER"]", json BINARY))                       | 33.3 KB | N/A   |
|   └─IndexMerge_18              | 0.00    | 673.26  | 1       | root      |                                                                                 | time:13.3ms, loops:2, index_task:{fetch_handle:3.455965ms, merge:5.423µs}, table_task:{num:1, concurrency:5, fetch_row:9.799043ms, wait_time:3.514198ms}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | type: union                                                                                                                              | 49.9 KB | N/A   |
|     ├─IndexRangeScan_15(Build) | 8.30    | 2145.61 | 10      | cop[tikv] | table:items, index:feed_profile_ids(cast(`feed_profile_ids` as unsigned array)) | time:3.45ms, loops:3, cop_task: {num: 1, max: 3.42ms, proc_keys: 10, tot_proc: 569.6µs, tot_wait: 2.56ms, rpc_num: 1, rpc_time: 3.41ms, copr_cache_hit_ratio: 0.00, build_task_duration: 13.2µs, max_distsql_concurrency: 1}, tikv_task:{time:0s, loops:1}, scan_detail: {total_process_keys: 10, total_process_keys_size: 1020, total_keys: 11, get_snapshot_time: 9.56µs, rocksdb: {key_skipped_count: 10, block: {cache_hit_count: 1, read_count: 1, read_byte: 13.4 KB, read_time: 445.9µs}}}                                                                                                                                                                | range:[59405969,59405969], keep order:false, stats:partial[normalized_short_link_clusters:unInitialized, feed_profile_ids:unInitialized] | N/A     | N/A   |
|     └─Selection_17(Probe)      | 0.00    | 5271.80 | 1       | cop[tikv] |                                                                                 | time:9.68ms, loops:2, cop_task: {num: 10, max: 7.95ms, min: 0s, avg: 1.45ms, p95: 7.95ms, max_proc_keys: 1, p95_proc_keys: 1, tot_proc: 3.66ms, tot_wait: 10.4ms, rpc_num: 6, rpc_time: 14.4ms, copr_cache_hit_ratio: 0.00, build_task_duration: 60µs, max_distsql_concurrency: 1, max_extra_concurrency: 3, store_batch_num: 4}, tikv_task:{proc max:1ms, min:0s, avg: 300µs, p80:1ms, p95:1ms, iters:10, tasks:10}, scan_detail: {total_process_keys: 10, total_process_keys_size: 206864, total_keys: 20, get_snapshot_time: 115.4µs, rocksdb: {key_skipped_count: 10, block: {cache_hit_count: 117, read_count: 4, read_byte: 97.2 KB, read_time: 674.7µs}}} | json_memberof(cast("OC8p1171XTkt.net/s/link", json BINARY), test.items.normalized_short_link_clusters)                                   | N/A     | N/A   |
|       └─TableRowIDScan_16      | 8.30    | 4857.52 | 10      | cop[tikv] | table:items                                                                     | tikv_task:{proc max:1ms, min:0s, avg: 300µs, p80:1ms, p95:1ms, iters:10, tasks:10}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | keep order:false, stats:partial[normalized_short_link_clusters:unInitialized, feed_profile_ids:unInitialized]                            | N/A     | N/A   |

3. What did you see instead (Required)


| id                             | estRows | estCost | actRows | task      | access object                                                                                                                        | execution info                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | operator info                                                                                                                                                              | memory  | disk  |
| Limit_10                       | 0.00    | 762.16  | 0       | root      |                                                                                                                                      | time:2.49s, loops:1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | offset:0, count:10                                                                                                                                                         | N/A     | N/A   |
| └─Selection_11                 | 0.00    | 762.16  | 0       | root      |                                                                                                                                      | time:2.49s, loops:1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | json_overlaps(test.items.product_sources, cast("["FEED_MERCHANT","MERCHANT_UI","CAROUSEL_CRAWLER"]", json BINARY))                                                         | 33.3 KB | N/A   |
|   └─IndexMerge_18              | 0.00    | 762.16  | 1       | root      |                                                                                                                                      | time:2.49s, loops:2, index_task:{fetch_handle:1.986440012s, merge:279.519351ms}, table_task:{num:28, concurrency:5, fetch_row:8.100928191s, wait_time:4.101294918s}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | type: union                                                                                                                                                                | 30.5 MB | N/A   |
|     ├─IndexRangeScan_15(Build) | 10.88   | 2883.70 | 492380  | cop[tikv] | table:items, index:normalized_short_link_clusters(cast(`normalized_short_link_clusters` as char(1000) array), point_of_sale_country) | time:1.86s, loops:485, cop_task: {num: 18, max: 427ms, min: 640.3µs, avg: 110.3ms, p95: 427ms, max_proc_keys: 50144, p95_proc_keys: 50144, tot_proc: 505.5ms, tot_wait: 54.5ms, rpc_num: 18, rpc_time: 1.98s, copr_cache_hit_ratio: 0.00, build_task_duration: 1.08ms, max_distsql_concurrency: 1}, tikv_task:{proc max:400ms, min:0s, avg: 103.3ms, p80:192ms, p95:400ms, iters:552, tasks:18}, scan_detail: {total_process_keys: 492380, total_process_keys_size: 66039117, total_keys: 504126, get_snapshot_time: 127.8µs, rocksdb: {delete_skipped_count: 101, key_skipped_count: 504209, block: {cache_hit_count: 252, read_count: 1552, read_byte: 17.2 MB, read_time: 109.2ms}}}                                | range:["OC8p1632XTkt.net/s/link","OC8p1632XTkt.net/s/link"], keep order:false, stats:partial[normalized_short_link_clusters:unInitialized, feed_profile_ids:unInitialized] | N/A     | N/A   |
|     └─Selection_17(Probe)      | 0.00    | 4615.94 | 1       | cop[tikv] |                                                                                                                                      | time:7.38s, loops:29, cop_task: {num: 265142, max: 2.39ms, min: 0s, avg: 7.14ms, p95: 22.3ms, max_proc_keys: 9, p95_proc_keys: 4, tot_proc: 3m44.6s, tot_wait: 1h18m23.6s, rpc_num: 53197, rpc_time: 31m33s, copr_cache_hit_ratio: 0.00, build_task_duration: 563ms, max_distsql_concurrency: 1, max_extra_concurrency: 320, store_batch_num: 211945}, tikv_task:{proc max:0s, min:0s, avg: 6.41ms, p80:8ms, p95:32ms, iters:265142, tasks:265142}, scan_detail: {total_process_keys: 492380, total_process_keys_size: 10172790635, total_keys: 984760, get_snapshot_time: 24.3s, rocksdb: {key_skipped_count: 492380, block: {cache_hit_count: 8125571, read_count: 645430, read_byte: 13.7 GB, read_time: 1m46.1s}}} | json_memberof(cast(80638367, json BINARY), test.items.feed_profile_ids)                                                                                                    | N/A     | N/A   |
|       └─TableRowIDScan_16      | 10.88   | 4073.03 | 492380  | cop[tikv] | table:items                                                                                                                          | tikv_task:{proc max:0s, min:0s, avg: 6.41ms, p80:8.1ms, p95:32ms, iters:265142, tasks:265142}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | keep order:false, stats:partial[normalized_short_link_clusters:unInitialized, feed_profile_ids:unInitialized]                                                              | N/A     | N/A   |

The estimation of string/bytes types within a single bucket of the histogram is not accurate

4. What is your TiDB version? (Required)

7.6.0

Jan 19 '24 11:01 seiya-annie

We need to find a way to store the single-column stats for the mv index.

Jan 19 '24 12:01 winoros

https://github.com/pingcap/tidb/pull/4601 After this tidb will precision-loss cast a user string as scalar value, so as inBucket's L and R bound, then use uniform distribution to calculate a ratio between these three precision-loss scalar values.

finally we use hg.NotNullCount()/float64(hg.NDV)) as the count est of the prefix json column‘s EQ

Jan 22 '24 13:01 AilinKid

tidb> explain SELECT /*+ use_index_merge(items, normalized_short_link_clusters), IGNORE_PLAN_CACHE() */    *  FROM    items  WHERE    57260686 member of (feed_profile_ids)    AND "OC8p1763XTkt.org/s/link" member of (normalized_short_link_clusters)    AND json_overlaps(      product_sources,      '["FEED_MERCHANT","MERCHANT_UI","CAROUSEL_CRAWLER"]'    )  LIMIT    1;
+------------------------------------+---------+-----------+--------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id                                 | estRows | task      | access object                                                                                                                        | operator info                                                                                                                                                              |
+------------------------------------+---------+-----------+--------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Limit_10                           | 0.00    | root      |                                                                                                                                      | offset:0, count:1                                                                                                                                                          |
| └─Selection_11                     | 0.00    | root      |                                                                                                                                      | json_overlaps(test.items.product_sources, cast("["FEED_MERCHANT","MERCHANT_UI","CAROUSEL_CRAWLER"]", json BINARY))                                                         |
|   └─IndexMerge_15                  | 0.00    | root      |                                                                                                                                      | type: union                                                                                                                                                                |
|     ├─IndexRangeScan_12(Build)     | 10.85   | cop[tikv] | table:items, index:normalized_short_link_clusters(cast(`normalized_short_link_clusters` as char(1000) array), point_of_sale_country) | range:["OC8p1763XTkt.org/s/link","OC8p1763XTkt.org/s/link"], keep order:false, stats:partial[normalized_short_link_clusters:unInitialized, feed_profile_ids:unInitialized] |
|     └─Selection_14(Probe)          | 0.00    | cop[tikv] |                                                                                                                                      | json_memberof(cast(57260686, json BINARY), test.items.feed_profile_ids)                                                                                                    |
|       └─TableRowIDScan_13          | 10.85   | cop[tikv] | table:items                                                                                                                          | keep order:false, stats:partial[normalized_short_link_clusters:unInitialized, feed_profile_ids:unInitialized]                                                              |
+------------------------------------+---------+-----------+--------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
6 rows in set, 5 warnings (6 min 1.12 sec)

Jan 22 '24 13:01 AilinKid

We need to find a way to store the single-column stats for the mv index.

two strings/bytes fall into the same bucket, use the first Byte of the slice of bytes[commonPrefixLen:] to be cast as scalar value will easily get the same lessCountA and lessCountB as the picture shows, consequently leading to return hg.NotNullCount()/float64(hg.NDV)), seems single column stats has the same problem here.

Jan 22 '24 13:01 AilinKid