matrixone icon indicating copy to clipboard operation
matrixone copied to clipboard

[Tech Request]: Improve testing of Query Plans for Master/Vector/Secondary Index

Open arjunsk opened this issue 1 year ago • 5 comments

Is there an existing issue for the same feature request?

  • [X] I have checked the existing issues.

Is your feature request related to a problem?

The performance of Index queries is not tested in regression tests. We need a strategy to test
- Plan generated by Index Queries
- Data stored inside the Index Hidden table
- Performance metrics of Index Queries over different runs

Describe the feature you'd like

Possible approaches to solve this

  • Modification to BVT framework to do Plan comparison
  • Running a Python code to validate the data stored in the hidden table.
  • Running a Python code to validate the QPS of Index queries.

Describe implementation you've considered

No response

Documentation, Adoption, Use Case, Migration Strategy

No response

Additional information

No response

arjunsk avatar Mar 19 '24 00:03 arjunsk

As suggested by @fengttt , one approach is to support

EXPLAIN FORMAT=JSON select * from t3;. Then query the JSON string to do JSON path etc to see if BlockFilter is present or not etc.

Reference

  1. JSON Extract
SELECT *
FROM your_table
WHERE JSON_EXTRACT(your_json_column, '$.your_json_path') REGEXP 'your_regex_pattern';**

arjunsk avatar Mar 19 '24 00:03 arjunsk

I don't think we support REGEX, LENGTH, etc functions on Query Plan output from "EXPLAIN".

arjunsk avatar Apr 02 '24 00:04 arjunsk

A. For Vector Index

  1. Search Query Correctness BVT is improved by using precise coordinates: https://github.com/matrixorigin/matrixone/blob/main/test/distributed/cases/array/array_index_knn.sql
  2. Vector Insert Performance will monitored with this test: https://github.com/matrixorigin/matrixone/issues/15018
  3. Vector QPS performance needs to be evaluated with this test: https://github.com/matrixorigin/matrixone/issues/15781

B. For Master Index No plan yet

C. For Secondary Index No plan yet

arjunsk avatar Apr 29 '24 03:04 arjunsk

I recommend to use random vector generation with same seek so that we can always generate same set of vectors without storing the data in database.

See the code here, https://github.com/cpegeric/wiki-benchmark/blob/main/python/indextest.py

The script support the following,

  1. build hnsw/ivfflat index with different op_type (vector_ip_ops, vector_cosine_ops, vector_l2_ops), any dimension
% python3 indextest.py build localhost eric ivf4 ivf4 vector_cosine_ops 3072 100000 ivfflat
create index ivf4 using ivfflat on ivf4(embed) lists=100 op_type "vector_cosine_ops"
create index time =  50.51455583330244  sec

  1. run recall to check the recall rate. Our target recall rate is > 0.9
% python3 indextest.py recall localhost eric ivf4 ivf4 vector_cosine_ops 3072 1000 ivfflat
start generate 1000 vectors
dataset generated and start search.
recall rate =  1.0 , elapsed =  88.41401337459683  sec,  88.41401337459683  ms/row, qps =  11.310424239686427

cpegeric avatar Mar 18 '25 11:03 cpegeric

目前回归已加入ivf index benchmark测试 https://github.com/matrixorigin/mo-nightly-regression/actions/runs/16753103693

heni02 avatar Aug 06 '25 07:08 heni02