surrealdb
surrealdb copied to clipboard
Bug: Vector NN search count does not respect WHERE clause
Describe the bug
I want to be able to perform query against vector field and some scalar fields at the same time. Say, I have vector V. I want to find closest 10 points, that have some flag set to true.
It doesn't seem to be possible to utilize vector index for such things. If we query
SELECT
id, flag, v
FROM
target
WHERE
flag = true &&
v <|10|> $point
;
it will take the closest 10 points and then filter those by flag. This way, I end up with less then 10 points.
Same issue comes into play when I try to paginate the results. If I want to specify START
, vector query must take it into account, and I need to query <|50|>
with START 40
if we want to take 10 records with the offset of 40.
It is impossible to build composite index on multiple fields if one of the fields is intended to be used for vector search.
Steps to reproduce
DELETE FROM pts;
DEFINE INDEX mt_pt1 ON pts FIELDS point MTREE DIMENSION 1;
INSERT INTO pts [
{ id: pts:1, point: [ 1f ], flag: true },
{ id: pts:2, point: [ 2f ], flag: false },
{ id: pts:3, point: [ 3f ], flag: true },
{ id: pts:4, point: [ 4f ], flag: false },
{ id: pts:5, point: [ 5f ], flag: true },
{ id: pts:6, point: [ 6f ], flag: false },
{ id: pts:7, point: [ 7f ], flag: true }
];
LET $pt = [4.5f];
SELECT
id,
i,
flag,
vector::similarity::cosine(point, $pt) AS similarity
FROM
pts
WHERE
flag = true &&
point <|2|> $pt
ORDER BY
similarity DESC PARALLEL;
SELECT
id,
i,
flag,
vector::similarity::cosine(point, $pt) AS similarity
FROM
pts
WHERE
flag = true &&
point <|2|> $pt
ORDER BY
similarity DESC
EXPLAIN
;
outputs
[
{
flag: true,
i: NONE,
id: pts:5,
similarity: 1
}
]
-------- Query 6 (200.001µs) --------
[
{
detail: {
plan: {
index: 'mt_pt1',
operator: '<2>',
value: [
4.5f
]
},
table: 'pts'
},
operation: 'Iterate Index'
},
{
detail: {
type: 'Store'
},
operation: 'Collector'
}
]
Expected behaviour
-------- Query 5 (299.999µs) --------
[
{
flag: true,
i: NONE,
id: pts:5,
similarity: 1
},
{
flag: true,
i: NONE,
id: pts:3,
similarity: 1
}
]
SurrealDB version
https://surrealist.app/
Contact Details
Is there an existing issue for this?
- [X] I have searched the existing issues
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Assigning to @emmanuel-keller who is able to answer this better.
The result you get is consistent with the current query logic, in the sense that each predicate returns a subset of the records, and the result is the intersection of these two subsets:
-
flag = true
returns[pts:1, pts:3, pts5, pts7]
-
point <|2|> $pt
returns[pts:4, pts5]
The intersection of both returns is [pts:5]
, which is correct from a technical point of view.
That said, it is true that we don't currently have a way to filter the KNN operation so you can have the N results matching the filter.
As a workaround, you could store two tables. One table would store points where the flag is true, and the other would store the points where the flag is false.
DELETE FROM pts_true;
DELETE FROM pts_false;
DEFINE INDEX mt_pt1 ON pts_true FIELDS point MTREE DIMENSION 1;
DEFINE INDEX mt_pt1 ON pts_false FIELDS point MTREE DIMENSION 1;
INSERT INTO pts_true [
{ id: pts_true:1, point: [ 1f ], flag: true },
{ id: pts_true:3, point: [ 3f ], flag: true },
{ id: pts_true:5, point: [ 5f ], flag: true },
{ id: pts_true:7, point: [ 7f ], flag: true }
];
INSERT INTO pts_false [
{ id: pts_false:2, point: [ 2f ], flag: false },
{ id: pts_false:4, point: [ 4f ], flag: false },
{ id: pts_false:6, point: [ 6f ], flag: false },
];
LET $pt = [4.5f];
SELECT
id,
vector::similarity::cosine(point, $pt) AS similarity
FROM
pts_true
WHERE
point <|2|> $pt
ORDER BY
similarity DESC;
That would work for this simple example, but I agree that it would not be suitable for something more complex involving more intricate filters.
To meet this requirement we could introduce the following syntax:
SELECT
id,
flag,
vector::similarity::cosine(point, $pt) AS similarity
FROM
pts
WHERE
flag = true &&
point <||> $pt
ORDER BY
similarity DESC
LIMIT 2
In this case, the KNN operator would not limit the result and would only stop providing results once the limit is reached. That would be compatible with the way SurrealDB executes queries, allowing for any complexity in the filtering as well as pagination.
Would that work for you / Would you be happy with this syntax?
Yes, this would be amazing, thank you! And thank you for mentioning it in the stream :)
Will it be the only syntax instead of the current one? Having it work only by LIMIT may simplify it
We have started working on the implementation. For this to work efficiently, we need to make sure that the query is primarily ordered by the KNN distance. So, we will add a vector::distance::knn()
function that will be mandatory to be placed in the ORDER BY clause. It will also return the computed distance, so the distance does not need to be recomputed.
SELECT
id,
flag,
vector::distance::knn() AS distance
FROM
pts
WHERE
flag = true &&
point <||> $pt
ORDER BY
vector::distance::knn() DESC
LIMIT 2
I think both syntaxes may coexist.
Awesome! Will it be possible to order by distance alias? Will it work for cosine? And will it be possible to build index on multiple fields, e.g. vector + boolean?