sql
sql copied to clipboard
[FEATURE] Add K-NN vector and query support to SQL Plugin
Is your feature request related to a problem?
The k-nn, or k-nearest neighbours plugin enables users to define knn_vector objects and perform nearest-neighbour queries on that objects in OpenSearch.
What solution would you like?
We want to add the following features:
knn_vectortype to the list of supported OS-SQL types- Add
Approximate k-NNto the list of supported OpenSearch query functions - Add
Script Score k-NNto the list of supported OpenSearch scripted functions, using thescriptfield. - Add
k-NN’s Painless extensionsas part of the script score k-nn
Questions:
TODO: determine if we can use theScriptScoreFunctionprovided by OpenSearch.- Determine which syntax 'makes sense' for each language. It seems simpler to use the
whereclause in PPL (with multiple `where clauses) to denote pre- and post- filter processing of the knn function. This makes less sense in SQL - and we may consider using the window functions or ordering the WHERE clause on purpose.
Proposed PPL syntax
knnandknn_scorefalls into thewherecommand. We can chain multiplewherecommands for pre-filter and post-filter processing.
where <pre-filter expression> | where knn_score(field, vector, k) | where <post-filter expression> | top size
Proposed SQL Syntax:
Option 1) use OpenSearch functions to call knn in the ORDER BY clause.
WHERE <pre-filter expression> ORDER BY knn(field, vector, k, <post-filter expression>) LIMIT size
- Approximate knn:
ORDER BY knn(field, vector, k) - Approximate knn with pre-filter query:
WHERE expression ORDER BY knn(field, vector, k) - Approximate knn with post-filter query:
ORDER BY knn(field, vector, k, expression)
WHERE <pre-filter expression> AND knn_score(field, vector, k) AND <post-filter expression> LIMIT size
Option 2) USe pgvector syntax to map to OpenSearch DSL functions
see: pgvector
Do you have any additional context?
reference:
- https://opensearch.org/docs/latest/search-plugins/knn/index/
- https://github.com/opensearch-project/sql/issues/1412
Proposed Schedule
- [ ] Add knn_vector datatype to OS-SQL and OS-PPL
- [ ] Add exact knn_score to PPL language
- [ ] Add approximate knn to PPL language with filtering
- [ ] Port syntax to SQL
- [ ] Add pgvector syntax to SQL
@penghuo @dai-chen is this on the roadmap/priority?
Instead of making PGVector a follow up, let's make sure it gets into the initial release even if that means pushing the release out.
@brijos Thanks for your comment. Does this also block a PPL release?
Thanks @acarbonetto. Let's not block on the PPL work if that is ready sooner. From a release announcement perspective, we'll announce them at the same time.
Perfect. I'll update the proposal to do a PPL release, SQL release with pgvector, then it will be marked completed.
Hi @acarbonetto , may I ask is there any progress on this issue? I am currently working on developing this feature and have already created a prototype that supports basic k-NN query. I would be happy to contribute to this feature if needed. Additionally, I have a few questions regarding this issue.
It seems simpler to use the where clause in PPL (with multiple `where clauses) to denote pre- and post- filter processing of the knn function. This makes less sense in SQL - and we may consider using the window functions or ordering the WHERE clause on purpose.
Could I kindly ask why you believe placing the knn function in the WHERE clause is illogical? IMHO, knn query and full-text quries are similar in OpenSearch, both return only highly rated documents for the user, we don't need to put it in order by clause.
WHERE
ORDER BY knn(field, vector, k, ) LIMIT size
AFAIW, the filter used in KNN queries is applied during the k-NN search. It is a pre-filter, not a post-filter. The post-filter can be achieved by boolean query.
Hey Folks
We have create our in-house SQL which serves many search use cases . We have introduced SQL for semantic search too.
Syntax: KNN (fieldName, EmbeddingList= [], K, Prefilters?)
wherein PreFilters are optional and can be any type of lexical expression combined that can by logical operators Like "AND"/ "OR"
Example- SELECT uuid WHERE KNN (field1, [1.0, 4.0, 6.3, 8.1], 5, country = "INDIA" AND rating > 3.4)
Since it is already been used at Uber Scale.