voyager icon indicating copy to clipboard operation
voyager copied to clipboard

Fewer than expected results were retrieved when querying for len(index) items

Open cvillela opened this issue 1 year ago • 2 comments

This is a noted issue already, but I am opening another card for visibility as the previous one has been open for over 4 months!

My objective would be to find the furthest neighbor in a index from a specific vector.

Calls for querying for N neighbors in an index of length N results in RuntimeError: Fewer than expected results were retrieved There are no NaN's in the set.

cluster_index = Index(
            index.space,
            index.num_dimensions,
            index.M,
            index.ef_construction,
            storage_data_type=index.storage_data_type
        )
        
cluster_index.add_items(
       vectors=index.get_vectors(list(cluster_dict[largest_cluster_key])),
       ids=list(cluster_dict[largest_cluster_key])
)
if np.any(np.isnan(cluster_index.get_vectors(list(cluster_index.ids)))):
            print("Nan found!")

print(f"Len Index {len(cluster_index)}")
neighbors, _ = cluster_index.query(
            vectors=any_vector,
            k=len(cluster_index)
        )

outputs

Len Index 828
RuntimeError: Fewer than expected results were retrieved; only found 825 of 828 requested neighbors.

Is this a parameter tuning problem? Such as any of the "ef" parameters in construction or querying?

Please note that this index also does not contain any mark_deleted() elements

cvillela avatar Feb 28 '24 16:02 cvillela

Hi @cvillela, that error occurs when the HNSW lookup is unable to retrieve the requested number of neighbors. This can happen if the underlying graph is for some reason disconnected. You could try increasing the M parameter at build time in order to increased the connectedness of the graph, or increase ef_construction in order to increase the search depth into the graph at build time.

Having said that, what are you trying to accomplish here by retrieving the entire dataset from the index? Might it be better served by just brute force calculating the distance between the target vector and every item in your dataset? You don't really get any advantage from an approximate nearest neighbors search if you end up searching the entire graph, since the library will calculate whatever distance measurement against every resolved candidate and the target vector anyway.

dylanrb123 avatar Mar 29 '24 18:03 dylanrb123

Hey @dylanrb123 , thank you very much for the response and sorry for the delay.

I am building an implementation of K-Means clustering with the Voyager library. An heuristic approximation for the construction of the clusters suffices for my application, and that is why I chose Voyager for the task.

On a specific step of the iteration, I am aiming to find the furthest neighbor inside a predefined cluster in order to set new centroids. This is the call crashing the program.

But I understand your explanation and, indeed, maybe it would make more sense to brute-force this step of the algorithm. Nevertheless I do believe documentation is a bit abstract in explaining the functionality of the construction parameters.

I will be closing this issue! Thanks again for the reply

cvillela avatar Apr 09 '24 18:04 cvillela