sqlite-vec icon indicating copy to clipboard operation
sqlite-vec copied to clipboard

SELECT with JOIN incompatible with standard LIMIT clause

Open quantatic opened this issue 1 year ago • 9 comments

I am attempting to use the following script, but run into issues with the LIMIT clause.

.echo on
.load sqlite-vec/dist/vec0.so

CREATE VIRTUAL TABLE vectors using vec0(
    id              INTEGER PRIMARY KEY NOT NULL,
    vector          FLOAT[4]
);

INSERT INTO vectors (id, vector) VALUES (1, '[1, 2, 3, 4]');
INSERT INTO vectors (id, vector) VALUES (2, '[-1, -2, -3, -4]');
INSERT INTO vectors (id, vector) VALUES (3, '[1, 5, 2, 7]');

CREATE TABLE vectors_metadata(
    id              INTEGER PRIMARY KEY NOT NULL,
    vector_id       INTEGER UNIQUE NOT NULL,
    other_metadata  INT
);

INSERT INTO vectors_metadata (id, vector_id, other_metadata) VALUES (1, 1, 55);
INSERT INTO vectors_metadata (id, vector_id, other_metadata) VALUES (2, 2, 22);
INSERT INTO vectors_metadata (id, vector_id, other_metadata) VALUES (3, 3, -5);

SELECT vectors_metadata.other_metadata
    FROM vectors_metadata
    INNER JOIN vectors ON vectors_metadata.vector_id = vectors.id
    WHERE vectors.vector MATCH '[1, 2, 5, 4]'
    LIMIT 2;

Upon running this with

sqlite3 :memory: < script.sqlite

I receive the following error:

Parse error near line 23: A LIMIT or 'k = ?' constraint is required on vec0 knn queries.

I see another similar issue was opened in #41, which suggests upgrading to a sqlite version >=3.42. I am running a version higher than this, so I suspect this is a separate issue.

$ sqlite3 --version
3.46.1 2024-08-13 09:16:08 c9c2ab54ba1f5f46360f1b4f35d849cd3f080e6fc2b6c60e91b16c63f69aalt1 (64-bit)

I have tested this with the following tagged releases, (and the head of main at the time of posting this issue), and receive the same error with all commits.

  • v0.1.1
  • v0.1.2-alpha.2
  • v0.1.2-alpha.7
  • d529eb09d16bafc612e9e5fa0de6d7924dea4e5e

quantatic avatar Sep 03 '24 05:09 quantatic

Thanks for the detailed report. A workaround is to use k = 2 in this specific query, like so:

SELECT vectors_metadata.other_metadata
FROM vectors_metadata
INNER JOIN vectors ON vectors_metadata.vector_id = vectors.id
WHERE vectors.vector MATCH '[1, 2, 5, 4]'
  AND k = 2;

Or perform the KNN query in a separate CTE step:

with knn_matches as (
    select id
    from vectors
    where vector match '[1, 2, 5, 4]'
      and k = 2
)
SELECT vectors_metadata.other_metadata
FROM knn_matches
INNER JOIN vectors_metadata ON vectors_metadata.vector_id = knn_matches.id;

In the second query, the k = 2 can be substituted with limit 2 in at least SQLite 3.46, maybe as early as 3.42.

I'm not 100% sure why the single query join doesn't work with the LIMIT clause. I fear this may be an edge case where the SQLite optimizer doesn't properly pass over the LIMIT constraint to the virtual table because of the JOIN.

Since the CTE works as expected, I'm not sure how big of a bug this is. In general I'm trying to only use the k = 999 syntax over the LIMIT 999 syntax because there are weird subtle errors with LIMIT on virtual tables, like this odd bug. If I track down the root cause I can file a bug with the SQLite team, but it might be hard to track down .Personally I like the CTE approach anyway since it clearly separate the KNN search from the metadata JOINs, but that's just personal preference

asg017 avatar Sep 03 '24 06:09 asg017

Much appreciated for the information. In my particular case, the k = <n> approach doesn't quite do what I want when I add further filters, for instance:

SELECT vectors_metadata.other_metadata
FROM vectors_metadata
INNER JOIN vectors ON vectors_metadata.vector_id = vectors.id
WHERE vectors.vector MATCH '[1, 2, 5, 4]'
  AND k = 2
  AND vectors_metadata.other_metadata > 0;

In this case, the k=2 filter appears to be applied first (getting the two closest matches), and only then are further filters applied. This means that when k=2 is used, for instance, the query may return <2 rows even when two or more valid rows exist, if the two closest vectors (or joined rows) fail to pass the additional conditions. For a more concrete example, with the same database setup as above, I run the following query:

SELECT vectors_metadata.other_metadata
FROM vectors_metadata
INNER JOIN vectors ON vectors_metadata.vector_id = vectors.id
WHERE vectors.vector MATCH '[1, 5, 2, 7]'
    AND k = 2
    AND vectors_metadata.other_metadata > 0;

I expect this query to return the values [55, 22], however the only row returned is [55].

Is there a way to get the behavior I'm looking for in this case with the k = <n> syntax? This may answer the question posed here as well.

Cheers :smile:

quantatic avatar Sep 03 '24 06:09 quantatic

Yes, I was posing a similar question, thank you for being more direct here.

In comparing to a solution like vectorlite, I'm trying to see how sqlite-vec can perform this example: https://github.com/1yefuwang1/vectorlite/blob/main/examples/metadata_filter.py

The workaround here is setting a very big K/LIMIT but I'm curious on the options.

davidmezzetti avatar Sep 03 '24 14:09 davidmezzetti

It seems this example SQL can support this use case.

SELECT id, distance FROM vectors v
WHERE vector MATCH '[1, 2, 5, 4]' AND id in (SELECT id FROM vectors_metadata o WHERE other_metadata > 0)
LIMIT 2

davidmezzetti avatar Sep 03 '24 15:09 davidmezzetti

It seems this example SQL can support this use case.

SELECT id, distance FROM vectors v
WHERE vector MATCH '[1, 2, 5, 4]' AND id in (SELECT id FROM vectors_metadata o WHERE other_metadata > 0)
LIMIT 2

It will work but has 2 problem: 1, what if the ids filtered by the second select are too many? 2, there is no way to get the values in other columns in a single query, we have to query again in another table then combine the results.

The FTS4 is also using virtual table and it doesn't have this issue, the fts4 virtual table can join other tables.

bellyjuice avatar Sep 24 '24 09:09 bellyjuice

Same problem here:

self._cursor.execute(
            f"""
                SELECT file_path, commit_hash, code
                FROM {FUNCTIONS_TABLE}
                JOIN {EMBEDDINGS_TABLE} ON
                    {FUNCTIONS_TABLE}.embedding_id = {EMBEDDINGS_TABLE}.rowid
                WHERE
                    {EMBEDDINGS_TABLE}.embedding MATCH ?
                    AND k = ?
                    AND {FUNCTIONS_TABLE}.project_id = ?
                ORDER BY distance
            """,
            (self._project_id, top_k, query_vector),
        )

If there is no way to make a JOIN, then there is no way to fetch anything but the actual vectors. Which in my case is pretty point less.

The workaround here is setting a very big K/LIMIT but I'm curious on the options.

Yes, in my case I could set k to 10 * top_k and filter the results client side. But this will lead to very inconsistent results.

JMLX42 avatar Jan 08 '25 19:01 JMLX42

@JMLX42 Use the new Metadata columns in vec0 tables feature like so:

create virtual table vec_embeddings using vec0(
  embedding float[1024],
  project_id integer partition key
);

In this case, project_id is a partition key column, since it appears you'll be using that as a filter very often. You can also use it as a metadata column with project_id integer, but in this very specific case, partition keys will likely be more performant.

Then to query, all you need would be:

self._cursor.execute(
            f"""
                SELECT file_path, commit_hash, code
                FROM {EMBEDDINGS_TABLE}
                LEFT JOIN {FUNCTIONS_TABLE} ON
                    {FUNCTIONS_TABLE}.embedding_id = {EMBEDDINGS_TABLE}.rowid
                WHERE
                    {EMBEDDINGS_TABLE}.embedding MATCH ?
                    AND k = ?
                    AND {EMBEDDINGS_TABLE}.project_id = ?
                ORDER BY distance
            """,
            (self._project_id, top_k, query_vector),
        )

asg017 avatar Jan 08 '25 19:01 asg017

Use the new Metadata columns in vec0 tables feature like so:

@asg017 in my case I ended up doing a separate database per project:

https://gitlab.com/lx-industries/wally-the-wobot/wally/-/blob/b16bafc6dbf529ab08e2dbcd06bccbe0f25920f3/wally/index.py#L51

https://gitlab.com/lx-industries/wally-the-wobot/wally/-/blob/b16bafc6dbf529ab08e2dbcd06bccbe0f25920f3/wally/index.py#L218

It makes sense anyway for multiple reasons.

But I might have to investigate the metadata solution anyway since I'll have to split long documents/functions into smaller chunks to make them pass the embedding model context length limitation.

JMLX42 avatar Jan 15 '25 09:01 JMLX42

It seems this example SQL can support this use case.

SELECT id, distance FROM vectors v WHERE vector MATCH '[1, 2, 5, 4]' AND id in (SELECT id FROM vectors_metadata o WHERE other_metadata > 0) LIMIT 2

interestingly, with not in, it doesn't seem to work: https://github.com/asg017/sqlite-vec/issues/116#issuecomment-3196572876

simmac avatar Aug 18 '25 12:08 simmac