matrixone icon indicating copy to clipboard operation
matrixone copied to clipboard

[Bug]: vector index didn't take effect

Open heni02 opened this issue 1 month ago • 1 comments

Is there an existing issue for the same bug?

  • [x] I have checked the existing issues.

Branch Name

v3.0.4-hotfix

Commit ID

6b326bfc7579da1019291cdd9ffc5d771a4806e8

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

以下向量查询的 sql 执行计划索引没有生效

sql1:
mysql> EXPLAIN SELECT md5_id FROM ca_comprehensive_dataset ORDER BY l2_distance(question_vector, (SELECT question_vector FROM ca_comprehensive_dataset AS ref WHERE md5_id='0000392a0ed2cf8989c4db9177532db7')) LIMIT 10;
+-------------------------------------------------------------------------------------------------------+
| AP QUERY PLAN ON ONE CN(64 core)                                                                      |
+-------------------------------------------------------------------------------------------------------+
| Project                                                                                               |
|   ->  Sort                                                                                            |
|         Sort Key: l2_distance(ca_comprehensive_dataset.question_vector, ref.question_vector) INTERNAL |
|         Limit: 10                                                                                     |
|         ->  Join                                                                                      |
|               Join Type: SINGLE                                                                       |
|               ->  Table Scan on anli_test.ca_comprehensive_dataset                                    |
|               ->  Table Scan on anli_test.ca_comprehensive_dataset                                    |
|                     Filter Cond: (ref.md5_id = '0000392a0ed2cf8989c4db9177532db7')                    |
|                     Block Filter Cond: (ref.md5_id = '0000392a0ed2cf8989c4db9177532db7')              |
+-------------------------------------------------------------------------------------------------------+

sql2:
SQL:
SELECT a.md5_id,
                           LEFT(a.question, 80) AS question_preview,
                           l2_distance(a.question_vector, ref_vec.question_vector) AS distance
                    FROM ca_comprehensive_dataset AS a
                    JOIN (
                        SELECT question_vector
                        FROM ca_comprehensive_dataset
                        WHERE md5_id = '70eaf7d252f5d4436ade9633340fbb2d'
                    ) AS ref_vec ON 1=1
                    WHERE a.question_vector IS NOT NULL
                    ORDER BY distance ASC
                    LIMIT 10
执行计划:
Project
  ->  Sort
        Sort Key: l2_distance(a.question_vector, ref_vec.question_vector) ASC
        Limit: 10
        ->  Project
              ->  Join
                    Join Type: INNER
                    ->  Table Scan on anli_test.ca_comprehensive_dataset
                          Filter Cond: true, (a.question_vector IS NOT NULL)
                    ->  Table Scan on anli_test.ca_comprehensive_dataset
                          Filter Cond: true, (ca_comprehensive_dataset.md5_id = '70eaf7d252f5d4436ade9633340fbb2d')
                          Block Filter Cond: (ca_comprehensive_dataset.md5_id = '70eaf7d252f5d4436ade9633340fbb2d')
sql3:
WITH sampled AS (
                        SELECT md5_id, question, question_vector
                        FROM ca_comprehensive_dataset
                        WHERE content_type = '品牌实力'
                        ORDER BY created_at DESC
                        LIMIT 200
                    )
                    SELECT a.md5_id AS md5_a,
                           b.md5_id AS md5_b,
                           l2_distance(a.question_vector, b.question_vector) AS distance
                    FROM sampled a
                    JOIN sampled b ON a.md5_id < b.md5_id
                    ORDER BY distance ASC
                    LIMIT 10
执行计划:
Project
  ->  Sort
        Sort Key: l2_distance(a.question_vector, b.question_vector) ASC
        Limit: 10
        ->  Project
              ->  Join
                    Join Type: INNER
                    Join Cond: (ca_comprehensive_dataset.md5_id < ca_comprehensive_dataset.md5_id)
                    ->  Sort
                          Sort Key: ca_comprehensive_dataset.created_at DESC
                          Limit: 200
                          ->  Join
                                Join Type: INDEX
                                Join Cond: (ca_comprehensive_dataset.md5_id = #[1,0])
                                Runtime Filter Build: #[-1,0]
                                ->  Table Scan on anli_test.ca_comprehensive_dataset [ForceOneCN]
                                      Filter Cond: (ca_comprehensive_dataset.content_type = '品牌实力')
                                      Runtime Filter Probe: ca_comprehensive_dataset.md5_id
                                ->  Index Table Scan on ca_comprehensive_dataset.idx_comprehensive_content_type [ForceOneCN]
                                      Filter Cond: prefix_eq(#[0,0])
                                      Block Filter Cond: prefix_eq(#[0,0])
                    ->  Sort
                          Sort Key: ca_comprehensive_dataset.created_at DESC
                          Limit: 200
                          ->  Join
                                Join Type: INDEX
                                Join Cond: (ca_comprehensive_dataset.md5_id = #[1,0])
                                Runtime Filter Build: #[-1,0]
                                ->  Table Scan on anli_test.ca_comprehensive_dataset [ForceOneCN]
                                      Filter Cond: (ca_comprehensive_dataset.content_type = '品牌实力')
                                      Runtime Filter Probe: ca_comprehensive_dataset.md5_id
                                ->  Index Table Scan on ca_comprehensive_dataset.idx_comprehensive_content_type [ForceOneCN]
                                      Filter Cond: prefix_eq(#[0,0])
                                      Block Filter Cond: prefix_eq(#[0,0])

Expected Behavior

No response

Steps to Reproduce

ddl:
SET foreign_key_checks = 0;set experimental_fulltext_index=1;set experimental_ivf_index=1;drop database if exists anli_test;create database anli_test;use anli_test;CREATE TABLE ca_comprehensive_dataset (md5_id varchar(255) NOT NULL,question text DEFAULT NULL,answer json DEFAULT NULL,source_type varchar(255) DEFAULT NULL,content_type varchar(255) DEFAULT NULL,keyword varchar(255) DEFAULT NULL,question_vector vecf64(1024) DEFAULT NULL COMMENT '摘要的向量集',allow_access varchar(511) DEFAULT NULL,allow_identities varchar(512) DEFAULT NULL,delete_flag int DEFAULT NULL,created_at timestamp DEFAULT CURRENT_TIMESTAMP(),updated_at timestamp DEFAULT CURRENT_TIMESTAMP() ON UPDATE CURRENT_TIMESTAMP(),PRIMARY KEY (md5_id),KEY idx_comprehensive_allow_access (allow_access),KEY idx_comprehensive_allow_identities (allow_identities),KEY idx_comprehensive_content_type (content_type));

load data url s3option {'endpoint'='http://cos.ap-guangzhou.myqcloud.com','access_key_id'='xx','secret_access_key'='xx','bucket'='mo-load-guangzhou-1308875761', 'filepath'='mo-big-data/ca_ai_ca_comprehensive_dataset.csv'} into table ca_comprehensive_dataset FIELDS TERMINATED BY ',' ENCLOSED BY '\"' LINES TERMINATED BY '\n' PARALLEL 'TRUE';

Additional information

No response

heni02 avatar Nov 26 '25 09:11 heni02

向量参数通过子查询传入的查询,难以用到目前方式实现的向量索引。

已经改成两个查询来达到需求。

aunjgr avatar Dec 15 '25 06:12 aunjgr