arcadedb icon indicating copy to clipboard operation
arcadedb copied to clipboard

SQL support for HNSW index

Open lvca opened this issue 1 year ago • 3 comments

ArcadeDB's HNSW indes is pretty powerful, but the lack of SQL support makes it hard to use via API.

We need some new function/method to expose the following methods from the index:

  • findNeighborsFromVector(TVector vector, int k): find max K neighbors from a vector of embeddings
  • findNeighborsFromId(TID id, int k): find max K neighbors starting from an id (indexed with the underlying LSMTree)
  • findNeighborsFromVertex(Vertex start, int k): find max K neighbors starting from a vertex

The easiest way is to create 3 new SQL functions to be used from SQL. Example:

select findNeighborsFromVector( "Word[name,vector]", [1,2,3,4,5,6], 10 )

The Java API returns a List<Pair<Identifiable, ? extends Number>>, with the vertex rid as the first argument and a number (float, double or whatever you pick at index creation) with the proximity. Ordered by proximity, the closest first.

With SQL it must be wrapped in a Result with "vertex" and "proximity" properties:

+------------------+---------------------+
| VERTEX           |           PROXIMITY |
+------------------+---------------------+
| #13:4            |                0.12 |
| #19:10           |                0.19 |
+------------------+---------------------+

So you can also cross the graph starting with embeddings:

select expand( vertex ) from (
  select findNeighborsFromVector( "Word[name,vector]", [1,2,3,4,5,6], 10 )
) where proximity < 0.5

To return all the neighbors with proximity less than 0.5 from the vector.

lvca avatar Feb 29 '24 19:02 lvca

@gramian pointed out we already have the SQL vectorNeighbors() function: https://docs.arcadedb.com/#_vectorneighbors. This is only to use the index when you have an ID, but not if you have a vector of embeddings.

lvca avatar Feb 29 '24 19:02 lvca