cockroach
cockroach copied to clipboard
opt: collect stats for virtual computed columns
Once #68186 is addressed, we will no longer collect stats for virtual computed columns. This can prevent the optimizer from accurately calculating the cost of scanning an index on a virtual column, and, transitively, the cost of scanning an expression index.
Here's an example:
CREATE TABLE t (k INT PRIMARY KEY, a INT, INDEX ((a%3)));
INSERT INTO t SELECT i, 1 FROM generate_series (1, 10) as s(i);
INSERT INTO t SELECT i, 2 FROM generate_series (11, 100) as s(i);
INSERT INTO t SELECT i, 3 FROM generate_series (101, 1000) as s(i);
ANALYZE t;
EXPLAIN SELECT k FROM t WHERE a%3 = 0;
info
---------------------------------------------------------------------------------
distribution: local
vectorized: true
• scan
estimated row count: 10 (1.0% of the table; stats collected 12 seconds ago)
table: t@t_expr_idx
spans: [/0 - /0]
(7 rows)
EXPLAIN SELECT k FROM t WHERE a%3 = 1;
info
---------------------------------------------------------------------------------
distribution: local
vectorized: true
• scan
estimated row count: 10 (1.0% of the table; stats collected 15 seconds ago)
table: t@t_expr_idx
spans: [/1 - /1]
(7 rows)
EXPLAIN SELECT k FROM t WHERE a%3 = 2;
info
---------------------------------------------------------------------------------
distribution: local
vectorized: true
• scan
estimated row count: 10 (1.0% of the table; stats collected 17 seconds ago)
table: t@t_expr_idx
spans: [/2 - /2]
(7 rows)
Notice that the estimated row count for each scan is the same, despite the fact that there are 10 rows where a%3 = 1
, 90 rows where a%3 = 2
, and 900 rows where a%3 = 0
.
In order to better estimated row counts in queries like this, we will have to:
- Collect statistics for virtual computed columns. Currently column statistics are sampled from the primary index. Virtual columns are not included in primary indexes, so there will be somewhere to sample virtual columns from indexes on them.
- Propagate virtual column statistics in statistics builder. A canonical scan on a primary index does not produce a virtual column. Instead, virtual column values are calculated at query execution time by wrapping a scan with a projection that produces the virtual column. Scans only contain statistics for columns they produce, so the challenge will be either producing virtual column statistics from a scan or from the projection that produces them.
Jira issue: CRDB-13888
Epic: CRDB-8949
We have some customer requests to get this done in 23.2
Also see https://github.com/cockroachdb/cockroach/issues/85176 and https://github.com/cockroachdb/cockroach/issues/78181 which are similar.
Another example of this coming up with an expression index:
CREATE TABLE t (i INT PRIMARY KEY, j JSONB);
INSERT INTO t SELECT i, json_build_object('a', i % 10) FROM generate_series(0, 9999) AS s(i);
-- Say we wanted to find rows with j->'a' = '3' but we did not want to index all of j.
-- In 23.2 we could use a forward index on j->'a' instead of an inverted index:
CREATE INDEX ON t ((j->'a'));
ANALYZE t;
-- This does use the index, but the lack of stats means we estimate 10 rows instead of 1000:
EXPLAIN SELECT * FROM t WHERE j->'a' = '3';
-- A workaround until stats are collected on virtual computed columns is to index a stored computed column:
DROP INDEX t_expr_idx;
ALTER TABLE t ADD COLUMN ja JSONB AS (j->'a') STORED;
CREATE INDEX ON t (ja);
ANALYZE t;
-- Now this uses the index and has good stats:
EXPLAIN SELECT * FROM t WHERE j->'a' = '3';
Hi @michae2, please add branch-* labels to identify which branch(es) this GA-blocker affects.
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.