kernel-memory icon indicating copy to clipboard operation
kernel-memory copied to clipboard

Different similarity results when using text-embedding-3-small or text-embedding-3-large models

Open marcominerva opened this issue 1 year ago • 4 comments

Context / Scenario

For the same document and question, when using text-embedding-3-small or text-embedding-3-large models, similarity returns results with lower relevance than when using text-embedding-ada-002 model.

What happened?

I'm using the code available at https://github.com/marcominerva/KernelMemoryService with SimpleVectorDb. I have imported the file Taggia.pdf, that is the PDF of the Italian Wikipedia page about the town of Taggia, Italy. Then, I have searched for "Quante persone vivono a Taggia?" (in English it is "How many people do live in Taggia?"),

If I use the text-embedding-ada-002, model digging into the source code of SimpleVectorDb,

https://github.com/microsoft/kernel-memory/blob/d127063db78943739397a952c0f19730306bfdab/service/Core/MemoryStorage/DevTools/SimpleVectorDb.cs#L115-L121

I obtain this:

image

However, if I use text-embedding-3-small (I have of course deleted the previous memories and re-imported the document), with the same question I get:

image

So, if I have these models I need to change the minRelevance parameter I use for my query. With text-embedding-ada-002, I use a value of 0.75, while with newer models it seems that anything grater than 0.5 is good. Do you agree?

NOTE: I get similar results also with Qdrant.

Importance

edge case

Platform, Language, Versions

Kernel Memory v0.35.240318.1

Relevant log output

No response

marcominerva avatar Mar 19 '24 10:03 marcominerva

I think that's expected behavior. Bigger and newer models capture more details and understand content better. Something that might seem relevant to ada2 might be less relevant to the other models. The opposite can happen too. In general when switching models, it's recommended to "fine tune" also thresholds, prompts and other "semantic" settings. Similar scenarios present with text generation when moving from GPT 3.5 to 4, and to other models. It's similar to changing an image/sound/video compression algorithm at the core of a game, noticing different quality,performance and artifacts, with the need to revisit settings and requirements. We briefly called out this topic last year at //build and the need of a new generation of dev tools to measure AI behavior, it's still early days, with some options for prompts fine tuning. I haven't seen anything for embeddings yet though.

dluc avatar Mar 19 '24 15:03 dluc

Thank you @dluc for the answer. Now I'm experimenting some weird situations in which a question that had a similarity of 0.79 with text-embedding-ada-002, now has only 0.33 with text-embedding-3-small and 0.27 with text-embedding-3-large, so it is very difficult to set a valid threshold.

Now I'm trying to increment MaxMatchesCount in the SearchClientConfig and MaxTokens used by Text Generation.

marcominerva avatar Mar 19 '24 16:03 marcominerva

Thank you @dluc for the answer. Now I'm experimenting some weird situations in which a question that had a similarity of 0.79 with text-embedding-ada-002, now has only 0.33 with text-embedding-3-small and 0.27 with text-embedding-3-large, so it is very difficult to set a valid threshold.

Now I'm trying to increment MaxMatchesCount in the SearchClientConfig and MaxTokens used by Text Generation.

That's a pretty big difference, are the chunks the same? Looking at the content, which model do you think is "right"? e.g. is the text actually relevant like ada002 says, or not so much as 3-small says?

dluc avatar Mar 19 '24 16:03 dluc

Yes, chunks are the same. The text is relevant as text-embedding-ada-002 says. For example, among the others I have a chunk (about 1000 tokens) that contains something like "and near the town there is Vivaldi Palace, built in 1458", and I ask for "When was the Vivaldi Palace built?":

  • text-embedding-ada-002 tells me that the chuk have a similarity of 0.79 with my question
  • text-embedding-3-small returns a similarity of 0.33
  • text-embedding-3-large returns the lowest 0.27

marcominerva avatar Mar 19 '24 16:03 marcominerva