sql icon indicating copy to clipboard operation
sql copied to clipboard

[FEATURE] Add K-NN vector and query support to SQL Plugin

Open acarbonetto opened this issue 2 years ago • 7 comments

Is your feature request related to a problem?

The k-nn, or k-nearest neighbours plugin enables users to define knn_vector objects and perform nearest-neighbour queries on that objects in OpenSearch.

What solution would you like?

We want to add the following features:

  • knn_vector type to the list of supported OS-SQL types
  • Add Approximate k-NN to the list of supported OpenSearch query functions
  • Add Script Score k-NN to the list of supported OpenSearch scripted functions, using the script field.
  • Add k-NN’s Painless extensions as part of the script score k-nn

Questions:

  • TODO: determine if we can use the ScriptScoreFunction provided by OpenSearch.
  • Determine which syntax 'makes sense' for each language. It seems simpler to use the where clause in PPL (with multiple `where clauses) to denote pre- and post- filter processing of the knn function. This makes less sense in SQL - and we may consider using the window functions or ordering the WHERE clause on purpose.

Proposed PPL syntax

  • knn and knn_score falls into the where command. We can chain multiple where commands for pre-filter and post-filter processing.
where <pre-filter expression> | where knn_score(field, vector, k) | where <post-filter expression> | top size

Proposed SQL Syntax:

Option 1) use OpenSearch functions to call knn in the ORDER BY clause.

WHERE <pre-filter expression> ORDER BY knn(field, vector, k, <post-filter expression>) LIMIT size
  • Approximate knn: ORDER BY knn(field, vector, k)
  • Approximate knn with pre-filter query: WHERE expression ORDER BY knn(field, vector, k)
  • Approximate knn with post-filter query: ORDER BY knn(field, vector, k, expression)
WHERE <pre-filter expression> AND knn_score(field, vector, k) AND <post-filter expression> LIMIT size

Option 2) USe pgvector syntax to map to OpenSearch DSL functions

see: pgvector

Do you have any additional context?

reference:

  • https://opensearch.org/docs/latest/search-plugins/knn/index/
  • https://github.com/opensearch-project/sql/issues/1412

Proposed Schedule

  • [ ] Add knn_vector datatype to OS-SQL and OS-PPL
  • [ ] Add exact knn_score to PPL language
  • [ ] Add approximate knn to PPL language with filtering
  • [ ] Port syntax to SQL
  • [ ] Add pgvector syntax to SQL

acarbonetto avatar May 08 '23 21:05 acarbonetto

@penghuo @dai-chen is this on the roadmap/priority?

derek-ho avatar May 18 '23 14:05 derek-ho

Instead of making PGVector a follow up, let's make sure it gets into the initial release even if that means pushing the release out.

brijos avatar Jul 28 '23 15:07 brijos

@brijos Thanks for your comment. Does this also block a PPL release?

acarbonetto avatar Jul 28 '23 16:07 acarbonetto

Thanks @acarbonetto. Let's not block on the PPL work if that is ready sooner. From a release announcement perspective, we'll announce them at the same time.

brijos avatar Jul 28 '23 17:07 brijos

Perfect. I'll update the proposal to do a PPL release, SQL release with pgvector, then it will be marked completed.

acarbonetto avatar Jul 28 '23 17:07 acarbonetto

Hi @acarbonetto , may I ask is there any progress on this issue? I am currently working on developing this feature and have already created a prototype that supports basic k-NN query. I would be happy to contribute to this feature if needed. Additionally, I have a few questions regarding this issue.

It seems simpler to use the where clause in PPL (with multiple `where clauses) to denote pre- and post- filter processing of the knn function. This makes less sense in SQL - and we may consider using the window functions or ordering the WHERE clause on purpose.

Could I kindly ask why you believe placing the knn function in the WHERE clause is illogical? IMHO, knn query and full-text quries are similar in OpenSearch, both return only highly rated documents for the user, we don't need to put it in order by clause.

WHERE ORDER BY knn(field, vector, k, ) LIMIT size

AFAIW, the filter used in KNN queries is applied during the k-NN search. It is a pre-filter, not a post-filter. The post-filter can be achieved by boolean query.

bugmakerrrrrr avatar Jul 03 '24 13:07 bugmakerrrrrr

Hey Folks We have create our in-house SQL which serves many search use cases . We have introduced SQL for semantic search too. Syntax: KNN (fieldName, EmbeddingList= [], K, Prefilters?) wherein PreFilters are optional and can be any type of lexical expression combined that can by logical operators Like "AND"/ "OR" Example- SELECT uuid WHERE KNN (field1, [1.0, 4.0, 6.3, 8.1], 5, country = "INDIA" AND rating > 3.4)

Since it is already been used at Uber Scale.

aparajita31pandey avatar Jul 22 '24 16:07 aparajita31pandey