matrixone
matrixone copied to clipboard
[Bug]: vector index didn't take effect
Is there an existing issue for the same bug?
- [x] I have checked the existing issues.
Branch Name
v3.0.4-hotfix
Commit ID
6b326bfc7579da1019291cdd9ffc5d771a4806e8
Other Environment Information
- Hardware parameters:
- OS type:
- Others:
Actual Behavior
以下向量查询的 sql 执行计划索引没有生效
sql1:
mysql> EXPLAIN SELECT md5_id FROM ca_comprehensive_dataset ORDER BY l2_distance(question_vector, (SELECT question_vector FROM ca_comprehensive_dataset AS ref WHERE md5_id='0000392a0ed2cf8989c4db9177532db7')) LIMIT 10;
+-------------------------------------------------------------------------------------------------------+
| AP QUERY PLAN ON ONE CN(64 core) |
+-------------------------------------------------------------------------------------------------------+
| Project |
| -> Sort |
| Sort Key: l2_distance(ca_comprehensive_dataset.question_vector, ref.question_vector) INTERNAL |
| Limit: 10 |
| -> Join |
| Join Type: SINGLE |
| -> Table Scan on anli_test.ca_comprehensive_dataset |
| -> Table Scan on anli_test.ca_comprehensive_dataset |
| Filter Cond: (ref.md5_id = '0000392a0ed2cf8989c4db9177532db7') |
| Block Filter Cond: (ref.md5_id = '0000392a0ed2cf8989c4db9177532db7') |
+-------------------------------------------------------------------------------------------------------+
sql2:
SQL:
SELECT a.md5_id,
LEFT(a.question, 80) AS question_preview,
l2_distance(a.question_vector, ref_vec.question_vector) AS distance
FROM ca_comprehensive_dataset AS a
JOIN (
SELECT question_vector
FROM ca_comprehensive_dataset
WHERE md5_id = '70eaf7d252f5d4436ade9633340fbb2d'
) AS ref_vec ON 1=1
WHERE a.question_vector IS NOT NULL
ORDER BY distance ASC
LIMIT 10
执行计划:
Project
-> Sort
Sort Key: l2_distance(a.question_vector, ref_vec.question_vector) ASC
Limit: 10
-> Project
-> Join
Join Type: INNER
-> Table Scan on anli_test.ca_comprehensive_dataset
Filter Cond: true, (a.question_vector IS NOT NULL)
-> Table Scan on anli_test.ca_comprehensive_dataset
Filter Cond: true, (ca_comprehensive_dataset.md5_id = '70eaf7d252f5d4436ade9633340fbb2d')
Block Filter Cond: (ca_comprehensive_dataset.md5_id = '70eaf7d252f5d4436ade9633340fbb2d')
sql3:
WITH sampled AS (
SELECT md5_id, question, question_vector
FROM ca_comprehensive_dataset
WHERE content_type = '品牌实力'
ORDER BY created_at DESC
LIMIT 200
)
SELECT a.md5_id AS md5_a,
b.md5_id AS md5_b,
l2_distance(a.question_vector, b.question_vector) AS distance
FROM sampled a
JOIN sampled b ON a.md5_id < b.md5_id
ORDER BY distance ASC
LIMIT 10
执行计划:
Project
-> Sort
Sort Key: l2_distance(a.question_vector, b.question_vector) ASC
Limit: 10
-> Project
-> Join
Join Type: INNER
Join Cond: (ca_comprehensive_dataset.md5_id < ca_comprehensive_dataset.md5_id)
-> Sort
Sort Key: ca_comprehensive_dataset.created_at DESC
Limit: 200
-> Join
Join Type: INDEX
Join Cond: (ca_comprehensive_dataset.md5_id = #[1,0])
Runtime Filter Build: #[-1,0]
-> Table Scan on anli_test.ca_comprehensive_dataset [ForceOneCN]
Filter Cond: (ca_comprehensive_dataset.content_type = '品牌实力')
Runtime Filter Probe: ca_comprehensive_dataset.md5_id
-> Index Table Scan on ca_comprehensive_dataset.idx_comprehensive_content_type [ForceOneCN]
Filter Cond: prefix_eq(#[0,0])
Block Filter Cond: prefix_eq(#[0,0])
-> Sort
Sort Key: ca_comprehensive_dataset.created_at DESC
Limit: 200
-> Join
Join Type: INDEX
Join Cond: (ca_comprehensive_dataset.md5_id = #[1,0])
Runtime Filter Build: #[-1,0]
-> Table Scan on anli_test.ca_comprehensive_dataset [ForceOneCN]
Filter Cond: (ca_comprehensive_dataset.content_type = '品牌实力')
Runtime Filter Probe: ca_comprehensive_dataset.md5_id
-> Index Table Scan on ca_comprehensive_dataset.idx_comprehensive_content_type [ForceOneCN]
Filter Cond: prefix_eq(#[0,0])
Block Filter Cond: prefix_eq(#[0,0])
Expected Behavior
No response
Steps to Reproduce
ddl:
SET foreign_key_checks = 0;set experimental_fulltext_index=1;set experimental_ivf_index=1;drop database if exists anli_test;create database anli_test;use anli_test;CREATE TABLE ca_comprehensive_dataset (md5_id varchar(255) NOT NULL,question text DEFAULT NULL,answer json DEFAULT NULL,source_type varchar(255) DEFAULT NULL,content_type varchar(255) DEFAULT NULL,keyword varchar(255) DEFAULT NULL,question_vector vecf64(1024) DEFAULT NULL COMMENT '摘要的向量集',allow_access varchar(511) DEFAULT NULL,allow_identities varchar(512) DEFAULT NULL,delete_flag int DEFAULT NULL,created_at timestamp DEFAULT CURRENT_TIMESTAMP(),updated_at timestamp DEFAULT CURRENT_TIMESTAMP() ON UPDATE CURRENT_TIMESTAMP(),PRIMARY KEY (md5_id),KEY idx_comprehensive_allow_access (allow_access),KEY idx_comprehensive_allow_identities (allow_identities),KEY idx_comprehensive_content_type (content_type));
load data url s3option {'endpoint'='http://cos.ap-guangzhou.myqcloud.com','access_key_id'='xx','secret_access_key'='xx','bucket'='mo-load-guangzhou-1308875761', 'filepath'='mo-big-data/ca_ai_ca_comprehensive_dataset.csv'} into table ca_comprehensive_dataset FIELDS TERMINATED BY ',' ENCLOSED BY '\"' LINES TERMINATED BY '\n' PARALLEL 'TRUE';
Additional information
No response
向量参数通过子查询传入的查询,难以用到目前方式实现的向量索引。
已经改成两个查询来达到需求。