chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[Bug]: Query on DuckDB returns less than k results

Open feffy380 opened this issue 2 years ago • 1 comments

What happened?

I consistently get fewer documents than requested when querying a DuckDB collection. For example, querying for 100 documents in a collection of 300+ only returns about 20-30. Seems to be the same bug encountered in #284. I've narrowed the reason down to LocalAPI._db.get_nearest_neighbors returning duplicates. As far as I can tell the problem is that hnswlib's knn_query doesn't handle small collections well (poor recall due to disconnected graph structure). I'm not really sure what can be done about this other than adding more data so this issue is mainly to document the behavior for the next poor soul tearing their hair out.

Versions

Chroma v0.3.21 Python v3.10.10

Relevant log output

# ids returned by get_nearest_neighbors. note the duplicate entries
61f76d2e-48cf-4a70-847a-b25a3c9f165b
61f76d2e-48cf-4a70-847a-b25a3c9f165b
61f76d2e-48cf-4a70-847a-b25a3c9f165b
61f76d2e-48cf-4a70-847a-b25a3c9f165b
61f76d2e-48cf-4a70-847a-b25a3c9f165b
1d3096b7-50c0-44df-bd5e-b394f34426a8
1d3096b7-50c0-44df-bd5e-b394f34426a8
1d3096b7-50c0-44df-bd5e-b394f34426a8
1d3096b7-50c0-44df-bd5e-b394f34426a8
1d3096b7-50c0-44df-bd5e-b394f34426a8
1d3096b7-50c0-44df-bd5e-b394f34426a8
5ba5ef75-133e-478a-b472-51b33857bb81
5ba5ef75-133e-478a-b472-51b33857bb81
5ba5ef75-133e-478a-b472-51b33857bb81
5ba5ef75-133e-478a-b472-51b33857bb81
7c86eb6b-2052-423c-b440-7a5977b5761f
5ba5ef75-133e-478a-b472-51b33857bb81
5ba5ef75-133e-478a-b472-51b33857bb81
7c86eb6b-2052-423c-b440-7a5977b5761f
ff68c219-8f42-4e46-80af-9e6cb36838fe
ff68c219-8f42-4e46-80af-9e6cb36838fe
5ba5ef75-133e-478a-b472-51b33857bb81
7c86eb6b-2052-423c-b440-7a5977b5761f
5ba5ef75-133e-478a-b472-51b33857bb81
7c86eb6b-2052-423c-b440-7a5977b5761f
ff68c219-8f42-4e46-80af-9e6cb36838fe

feffy380 avatar May 03 '23 09:05 feffy380

There's some index parameters documented here: https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md They explicitly recommend choosing higher M for word embeddings, so it might be worth fiddling with these settings.

They can be configured like so:

vectorstore = Chroma(
    collection_metadata={"hnsw:M": 48, "hnsw:construction_ef": 100}, ...
)

I think the collection has to be regenerated afterward.

feffy380 avatar May 03 '23 10:05 feffy380

@feffy380 would you like to keep this open? did adjusting the settings work for you?

jeffchuber avatar May 08 '23 16:05 jeffchuber

@jeffchuber I had to start over with a fresh db and haven't had a chance to test different settings. I'll close this though since it's technically an upstream issue

feffy380 avatar May 08 '23 22:05 feffy380