chunkdot icon indicating copy to clipboard operation
chunkdot copied to clipboard

Add a similarity score threshold

Open VictorXQWang opened this issue 1 year ago • 1 comments

A great library, making it possible for me to work on 100K embeddings on my laptop!

I have a couple of suggestions.

Other than top_k, is it possible to return all items having a cosine similarity score above a given threshold, e.g., 0.99?

Currently, it is not possible to return all pairwise similarities, due to the following restriction:

if abs_top_k >= n_rows_right:
    raise ValueError(
        f"The number of requested similar items (top_k={abs_top_k}) must be less than the "
        f"number of items available for comparison ({n_rows_right})"
    )

I think "abs_top_k >= n_rows_right" can be changed to "abs_top_k > n_rows_right". Or alternatively, when abs_top_k >= n_rows_right, set abs_top_k to n_rows_right and provide a warning.

If the other_embeddings has only one row, an error will be thrown currently.

After this change, it will be possible to return all pairwise similarities and users can screen them based on a threshold. This is not possible currently.

VictorXQWang avatar Apr 03 '24 04:04 VictorXQWang

@VictorXQWang thanks a lot for your suggestions.

  1. Is it possible to set a min similarity threshold?

The problem that this library addresses is to deal with the memory constraint of keeping all similarity values. This works by knowing in advance how to partition the matrix given the wanted size of the result "top_k". If instead a "min_similarity_reshold" is used I cannot know in advance how many similarity matrix entries are non-zero therefore I cannot calculate in advance the partitions that I need to make for the calculations to fit in memory. It might be that it works if not many similarities are bigger than the threshold but it might be that it doesn't. It is an improvement over other methods where for sure they won't fit though, but it just feels itchy to add a functionality that I won't be sure it will succeed.

  1. not able to return all similarities

The logic is that I would advice to use other methods (like sklearn cosine similarity), as ChunkDot does not produce any advantage in this scenario. It would slow it down probably. I understand that is not the best for a user experience to be juggling between different functions depending on the case. Perhaps in this scenario I could just use sklearn cosine similarity function, but I don't like to add dependencies as heavy as sklearn just for a function.

Really good points though... Let me think about it. And please let me know what you think and/or if you would like to contribute!

rragundez avatar Apr 03 '24 10:04 rragundez