langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Chroma DB : Cannot return the results in a contiguous 2D array

Open achammah opened this issue 1 year ago • 0 comments

Issue

Sometimes when doing search similarity using chromaDB wrapper, I run into the following issue: RuntimeError(\'Cannot return the results in a contigious 2D array. Probably ef or M is too small\')

Some background info:

ChromaDB is a library for performing similarity search on high-dimensional data. It uses an approximate nearest neighbor (ANN) search algorithm called Hierarchical Navigable Small World (HNSW) to find the most similar items to a given query. The parameters ef and M are related to the HNSW algorithm and have an impact on the search quality and performance.

  1. ef: This parameter controls the size of the dynamic search list used by the HNSW algorithm. A higher value for ef results in a more accurate search but slower search speed. A lower value will result in a faster search but less accurate results.
  2. M: This parameter determines the number of bi-directional links created for each new element during the construction of the HNSW graph. A higher value for M results in a denser graph, leading to higher search accuracy but increased memory consumption and construction time.

The error message you encountered indicates that either or both of these parameters are too small for the current dataset. This can cause issues when trying to return the search results in a contiguous 2D array. To resolve this error, you can try increasing the values of ef and M in the ChromaDB configuration or during the search query.

It's important to note that the optimal values for ef and M can depend on the specific dataset and use case. You may need to experiment with different values to find the best balance between search accuracy, speed, and memory consumption for your application.

My proposal

3 possibilities:

  • Simple one: .adding ef and M optional parameter to similarity_search
  • More complex one : adding a retrial system that tries over a range ef and M when encountering the issue built into similarity search
  • Very complex one: calculating optimal ef and M within similarity_search to always have optimal ef and M

achammah avatar Apr 27 '23 16:04 achammah