matrixone [Tech Request]: Improve testing of Query Plans for Master/Vector/Secondary Index

Is there an existing issue for the same feature request?

[X] I have checked the existing issues.

Is your feature request related to a problem?

The performance of Index queries is not tested in regression tests. We need a strategy to test
- Plan generated by Index Queries
- Data stored inside the Index Hidden table
- Performance metrics of Index Queries over different runs

Describe the feature you'd like

Possible approaches to solve this

Modification to BVT framework to do Plan comparison
Running a Python code to validate the data stored in the hidden table.
Running a Python code to validate the QPS of Index queries.

Describe implementation you've considered

No response

Documentation, Adoption, Use Case, Migration Strategy

No response

Additional information

No response

Mar 19 '24 00:03 arjunsk

As suggested by @fengttt , one approach is to support

EXPLAIN FORMAT=JSON select * from t3;. Then query the JSON string to do JSON path etc to see if BlockFilter is present or not etc.

Reference

JSON Extract

SELECT *
FROM your_table
WHERE JSON_EXTRACT(your_json_column, '$.your_json_path') REGEXP 'your_regex_pattern';**

Mar 19 '24 00:03 arjunsk

I don't think we support REGEX, LENGTH, etc functions on Query Plan output from "EXPLAIN".

Apr 02 '24 00:04 arjunsk

A. For Vector Index

Search Query Correctness BVT is improved by using precise coordinates: https://github.com/matrixorigin/matrixone/blob/main/test/distributed/cases/array/array_index_knn.sql
Vector Insert Performance will monitored with this test: https://github.com/matrixorigin/matrixone/issues/15018
Vector QPS performance needs to be evaluated with this test: https://github.com/matrixorigin/matrixone/issues/15781

B. For Master Index No plan yet

C. For Secondary Index No plan yet

Apr 29 '24 03:04 arjunsk

I recommend to use random vector generation with same seek so that we can always generate same set of vectors without storing the data in database.

See the code here, https://github.com/cpegeric/wiki-benchmark/blob/main/python/indextest.py

The script support the following,

build hnsw/ivfflat index with different op_type (vector_ip_ops, vector_cosine_ops, vector_l2_ops), any dimension

% python3 indextest.py build localhost eric ivf4 ivf4 vector_cosine_ops 3072 100000 ivfflat
create index ivf4 using ivfflat on ivf4(embed) lists=100 op_type "vector_cosine_ops"
create index time =  50.51455583330244  sec

run recall to check the recall rate. Our target recall rate is > 0.9

% python3 indextest.py recall localhost eric ivf4 ivf4 vector_cosine_ops 3072 1000 ivfflat
start generate 1000 vectors
dataset generated and start search.
recall rate =  1.0 , elapsed =  88.41401337459683  sec,  88.41401337459683  ms/row, qps =  11.310424239686427

Mar 18 '25 11:03 cpegeric

目前回归已加入ivf index benchmark测试 https://github.com/matrixorigin/mo-nightly-regression/actions/runs/16753103693

Aug 06 '25 07:08 heni02