kernel-memory
kernel-memory copied to clipboard
Different similarity results when using text-embedding-3-small or text-embedding-3-large models
Context / Scenario
For the same document and question, when using text-embedding-3-small or text-embedding-3-large models, similarity returns results with lower relevance than when using text-embedding-ada-002 model.
What happened?
I'm using the code available at https://github.com/marcominerva/KernelMemoryService with SimpleVectorDb. I have imported the file Taggia.pdf, that is the PDF of the Italian Wikipedia page about the town of Taggia, Italy. Then, I have searched for "Quante persone vivono a Taggia?" (in English it is "How many people do live in Taggia?"),
If I use the text-embedding-ada-002, model digging into the source code of SimpleVectorDb,
https://github.com/microsoft/kernel-memory/blob/d127063db78943739397a952c0f19730306bfdab/service/Core/MemoryStorage/DevTools/SimpleVectorDb.cs#L115-L121
I obtain this:
However, if I use text-embedding-3-small (I have of course deleted the previous memories and re-imported the document), with the same question I get:
So, if I have these models I need to change the minRelevance parameter I use for my query. With text-embedding-ada-002, I use a value of 0.75, while with newer models it seems that anything grater than 0.5 is good. Do you agree?
NOTE: I get similar results also with Qdrant.
Importance
edge case
Platform, Language, Versions
Kernel Memory v0.35.240318.1
Relevant log output
No response
I think that's expected behavior. Bigger and newer models capture more details and understand content better. Something that might seem relevant to ada2 might be less relevant to the other models. The opposite can happen too. In general when switching models, it's recommended to "fine tune" also thresholds, prompts and other "semantic" settings. Similar scenarios present with text generation when moving from GPT 3.5 to 4, and to other models. It's similar to changing an image/sound/video compression algorithm at the core of a game, noticing different quality,performance and artifacts, with the need to revisit settings and requirements. We briefly called out this topic last year at //build and the need of a new generation of dev tools to measure AI behavior, it's still early days, with some options for prompts fine tuning. I haven't seen anything for embeddings yet though.
Thank you @dluc for the answer. Now I'm experimenting some weird situations in which a question that had a similarity of 0.79 with text-embedding-ada-002, now has only 0.33 with text-embedding-3-small and 0.27 with text-embedding-3-large, so it is very difficult to set a valid threshold.
Now I'm trying to increment MaxMatchesCount in the SearchClientConfig and MaxTokens used by Text Generation.
Thank you @dluc for the answer. Now I'm experimenting some weird situations in which a question that had a similarity of 0.79 with
text-embedding-ada-002, now has only 0.33 withtext-embedding-3-smalland 0.27 withtext-embedding-3-large, so it is very difficult to set a valid threshold.Now I'm trying to increment
MaxMatchesCountin theSearchClientConfigandMaxTokensused by Text Generation.
That's a pretty big difference, are the chunks the same? Looking at the content, which model do you think is "right"? e.g. is the text actually relevant like ada002 says, or not so much as 3-small says?
Yes, chunks are the same. The text is relevant as text-embedding-ada-002 says. For example, among the others I have a chunk (about 1000 tokens) that contains something like "and near the town there is Vivaldi Palace, built in 1458", and I ask for "When was the Vivaldi Palace built?":
text-embedding-ada-002tells me that the chuk have a similarity of 0.79 with my questiontext-embedding-3-smallreturns a similarity of 0.33text-embedding-3-largereturns the lowest 0.27