[Tech Request]: Improve testing of Query Plans for Master/Vector/Secondary Index
Is there an existing issue for the same feature request?
- [X] I have checked the existing issues.
Is your feature request related to a problem?
The performance of Index queries is not tested in regression tests. We need a strategy to test
- Plan generated by Index Queries
- Data stored inside the Index Hidden table
- Performance metrics of Index Queries over different runs
Describe the feature you'd like
Possible approaches to solve this
- Modification to BVT framework to do Plan comparison
- Running a Python code to validate the data stored in the hidden table.
- Running a Python code to validate the QPS of Index queries.
Describe implementation you've considered
No response
Documentation, Adoption, Use Case, Migration Strategy
No response
Additional information
No response
As suggested by @fengttt , one approach is to support
EXPLAIN FORMAT=JSON select * from t3;. Then query the JSON string to do JSON path etc to see if BlockFilter is present or not etc.
Reference
- JSON Extract
SELECT *
FROM your_table
WHERE JSON_EXTRACT(your_json_column, '$.your_json_path') REGEXP 'your_regex_pattern';**
I don't think we support REGEX, LENGTH, etc functions on Query Plan output from "EXPLAIN".
A. For Vector Index
- Search Query Correctness BVT is improved by using precise coordinates: https://github.com/matrixorigin/matrixone/blob/main/test/distributed/cases/array/array_index_knn.sql
- Vector Insert Performance will monitored with this test: https://github.com/matrixorigin/matrixone/issues/15018
- Vector QPS performance needs to be evaluated with this test: https://github.com/matrixorigin/matrixone/issues/15781
B. For Master Index No plan yet
C. For Secondary Index No plan yet
I recommend to use random vector generation with same seek so that we can always generate same set of vectors without storing the data in database.
See the code here, https://github.com/cpegeric/wiki-benchmark/blob/main/python/indextest.py
The script support the following,
- build hnsw/ivfflat index with different op_type (vector_ip_ops, vector_cosine_ops, vector_l2_ops), any dimension
% python3 indextest.py build localhost eric ivf4 ivf4 vector_cosine_ops 3072 100000 ivfflat
create index ivf4 using ivfflat on ivf4(embed) lists=100 op_type "vector_cosine_ops"
create index time = 50.51455583330244 sec
- run recall to check the recall rate. Our target recall rate is > 0.9
% python3 indextest.py recall localhost eric ivf4 ivf4 vector_cosine_ops 3072 1000 ivfflat
start generate 1000 vectors
dataset generated and start search.
recall rate = 1.0 , elapsed = 88.41401337459683 sec, 88.41401337459683 ms/row, qps = 11.310424239686427
目前回归已加入ivf index benchmark测试 https://github.com/matrixorigin/mo-nightly-regression/actions/runs/16753103693