.Net: VectorStore: Add ability to filter results by the similarity score
Would be nice to be able to filter by the similarity search score on a database level.
For example, If I wanted to return just the records that is greater than or equal to 0.80. I am able to do this in Azure Cosmos Mongo DB instance by adding a match pipeline.
BsonDocument[] pipeline =
[
new BsonDocument
{
{
"$search", new BsonDocument
{
{
"cosmosSearch", new BsonDocument
{
{ "vector", embeddings },
{ "path", indexProperty },
{ "k", 5 }
}
},
{ "returnStoredSource", true },
}
}
},
new BsonDocument
{
{
"$project", new BsonDocument
{
{ "id", 1 },
{ "name", 1 },
{ "text", 1 },
{ "url", 1 },
{ "vector", 1 },
{ "searchScore", new BsonDocument { { "$meta", "searchScore" } } }
}
}
},
new BsonDocument
{
{
"$match", new BsonDocument
{
{ "searchScore", new BsonDocument { { "$gte", 0.80 } } }
}
}
}
];
Something to keep in mind is that greater than 0.80 means different things depending on which distance function is being used. E.g. if Cosine Similarity is used 0.8 is similar. If Cosine Distance is used 0.8 is closer to orthogonal and greater than 0.8 in this case will mean less similar. For Euclidian distance closer to 0 is more similar and a larger number is less similar.
Any option like this will therefore not have a fixed range or comparison type (e.g. greater than or less than), and the limit that is considered similar will vary depending on the distance function chosen when defining the vector.
Not all databases allow the level of control that Mongo DB allows i.e. { "$gte", 0.80 }. Some just support a limit. Therefore any option that we support will need to cater for this.
I would love to see something like this as well.
I am using the RedisVectorStore implementation and the Score that comes back from VectorizedSearchAsync(...) is always zero. Note, that I am using the new Azure Managed Redis offering with the Balanced (B0) sku. I believe the ranking/sorting is being done still because results come back properly sorted according to my expectations, its just the score is always 0.
In using VectorizedSearchAsync to do the equivalent of semantic cache, I always get results when I send an embedding in regardless of how far off the match should be. So if I can't set it on the VectorizedSearchAsync method to only return based on a certain score, I'd love for Score to at least be populated with something other than 0 for every call and then worst-case, I could just get data back, and then check for a Score threshold and then react as a hit or miss that way.
If there are plans for Semantic Cache support directly within SK using C# so that I didn't have to punt over to Python and use Langchain or the RedisVL library, that would be even better.
@scottroot-msft, thanks for reporting the score issue, I have logged a separate bug for it, and I have a pr out for it: #9900