langchain Similarity search not working well when number of ingested documents is great, say over one hundred.

When few documets embedded into vector db everything works fine, with similarity search I can always find the most relevant documents on the top of results. But when it comes to over hundred, searching result will be very confusing, given the same query I could not find any relevant documents. I've tried Chroma, Faiss, same story.

Anyone has any idea? thanks. btw, the documents are in Chinese...

Apr 05 '23 14:04 Jeru2023

Yeah, I experience the same. I think it's related to the embedding method used. In my experience, OpenAI embedding seems to work better than the ones from HuggingFace.

Apr 05 '23 17:04 20001LastOrder

@20001LastOrder Thanks, it must be embedding issue for Chinese content, I've tried ada-embedding, huggingface sententce embedding, and chromadb default embedding, among these three solutions ada-embedding seems outperforms a little, but still missing accuracy as the number of documents increased.

Apr 06 '23 02:04 Jeru2023

I would like to know if anyone having similar expierence when using English document?

Apr 06 '23 02:04 Jeru2023

189 pages pdf document of mine with FOR loop similarity with score takes poor result. It does not return response even a page having same words as my query. for this trial, HuggingfaceEmbeddings with FAISS or Chroma was utilized.

Apr 20 '23 10:04 mozzipa

189 pages pdf document of mine with FOR loop similarity with score takes poor result. It does not return response even a page having same words as my query. for this trial, HuggingfaceEmbeddings with FAISS or Chroma was utilized.

English content?

Apr 20 '23 12:04 Jeru2023

Yes, it is written in English.

Apr 20 '23 16:04 mozzipa

i'm having similar issues with English content using LlamaCppEmbeddings. in fact, most relevant document is often the last or second to last document in the list which makes it essentially impossible to do question answering with document context using LlamaCpp.

i've tried Chroma, FAISS and DataLake vectorstores. i've also tried similarity_search and max_marginal_relevance with vector based and text based inputs. all seem to suffer from this problem which leads me to believe that it's an issue with LlamaCppEmbeddings

Apr 21 '23 17:04 khimaros

testing with the default embedding function SentenceTransformerEmbeddingFunction and i'm getting far more predictable and reasonable output from similarity_search with Chroma.

it's surprising to me that LlamaCppEmbeddings is doing such a poor job. i'll be filing a ticket with llama-cpp-python because i don't see how the problem could be with the very simple implementation in langchain -- it is doing almost nothing itself.

Apr 22 '23 20:04 khimaros

filed https://github.com/abetlen/llama-cpp-python/issues/105

Apr 22 '23 21:04 khimaros

Yes, it is written in English.

For English content I believed that ada-embedding will be your best choise, the higher demenion of the vector the better performance.

Apr 23 '23 14:04 Jeru2023

Thanks for creating this issue! It touches on a very important point that generic pre-trained embeddings usually don't provide the best recall for retrieval on your particular domain. OpenAI is "easy" to set up, but you'll have to evaluate for your specific use case

Apr 23 '23 22:04 vowelparrot

Yes, it is written in English.

For English content I believed that ada-embedding will be your best choise, the higher demenion of the vector the better performance.

Thanks for advice. I agree that openAI will provide good result. But I would like to have suitably useful result from other models such as HuggingfaceEmbedding.

Apr 24 '23 00:04 mozzipa

I see, I have a friend who tested over 100 embedding models, he ranked top performed models, I will ask him which one is best alternative of ada, will get back to you once I have his reply.

Apr 24 '23 06:04 Jeru2023

I see, I have a friend who tested over 100 embedding models, he ranked top performed models, I will ask him which one is best alternative of ada, will get back to you once I have his reply.

Hello, I was just wondering if you've had a chance to hear back from your friend regarding the best alternative to ada. If you have any information, I would appreciate it if you could share it with us :).

Apr 28 '23 10:04 melbarra88

I see, I have a friend who tested over 100 embedding models, he ranked top performed models, I will ask him which one is best alternative of ada, will get back to you once I have his reply.

Hello, I was just wondering if you've had a chance to hear back from your friend regarding the best alternative to ada. If you have any information, I would appreciate it if you could share it with us :).

distiluse-base-multilingual-cased-v1 and distiluse-base-multilingual-cased-v2, shibing624/text2vec-base-chinese

Apr 28 '23 15:04 Jeru2023

I'm also facing a similar problem. Is there a way to fine-tune embedding models on custom datasets (e.g. text from 100 PDFs)? For this purpose, do I need to label the data manually into sentence pairs?

Jun 19 '23 02:06 asadabbas09

For OAI https://github.com/openai/openai-cookbook/blob/main/examples/Customizing_embeddings.ipynb For Huggingface - https://huggingface.co/blog/how-to-train-sentence-transformers

Jun 19 '23 02:06 vowelparrot

I posted a question to another thread but think this might be useful here too. The sentence transformers have a truncation length and I suspect that if the text splitter creates document that are larger that what the embedding model can ingest, the embeddings will be of poor quality.

Sadly, the multilingual sentence transformers all have a truncation length of 128 tokens !

See my question here: https://github.com/hwchase17/langchain/issues/2026#issuecomment-1601097513

Jun 21 '23 16:06 thiswillbeyourgithub

met the same problem .

Aug 15 '23 15:08 sytpb

Same here. example 25057b

Aug 23 '23 12:08 pixeldu

(copying my message from there )

I'm sharing my own 'rolling' sbert script to avoid clipping the sentences. It's seemingly functionnal but not very elegent, a class would be better of course but I just hope it helps someone :

moved the code to this comment

Sep 24 '23 16:09 thiswillbeyourgithub

Met the same issue The way our team handled it is to first reduce search space/document number by labeling the docs(by the "abstract" or something) in advance and do vector similarity search for those labels, and then do the similarity search in the reduced space. Also the way you split documentation is very important, the longer the chunks the more difficult to match that key word. So we have tried to split the chunks by fixed number of characters to make the chunks smaller.

Hope this helps!

Sep 30 '23 00:09 luozixuan

I would like to know if anyone having similar expierence when using English document?

Yes I have the same issue over 100 docs.

Nov 13 '23 15:11 jerem99

Met the same issue The way our team handled it is to first reduce search space/document number by labeling the docs(by the "abstract" or something) in advance and do vector similarity search for those labels

@luozixuan Hi, May I know how do you guys label? For large document, the similarity search could not find the right matched chunk or even if it found the right chunk, throwing too much chunks to gpt makes gpt unable to make right answer. I was thinking to create metadata of each chunk such as keywords but I do not know how to create keywords of chunk. So i am interested how you are labeling the documents.

Dec 03 '23 14:12 arafat0007

Met the same issue The way our team handled it is to first reduce search space/document number by labeling the docs(by the "abstract" or something) in advance and do vector similarity search for those labels

@luozixuan Hi, May I know how do you guys label? For large document, the similarity search could not find the right matched chunk or even if it found the right chunk, throwing too much chunks to gpt makes gpt unable to make right answer. I was thinking to create metadata of each chunk such as keywords but I do not know how to create keywords of chunk. So i am interested how you are labeling the documents.

I will share my scenario, say I want to infer the customer typical opinion from million customer reviews, the content of the reviews are somewhat similar, I want to have LLM thoroughly skim all the trunks of reviews(let's say 200 reviews per trunk), the only solution I can think of so far is map-reduce, I'm not sure if you guys have any better thought?

Dec 11 '23 04:12 Jeru2023

Vector representation works great with frequent entries but gives poor results at the tail of query distribution. In such cases, a simple keyword yields better IR.

Dec 21 '23 17:12 vskritsk

Workaround I found is to purge vector db from memory and create a new vectorstore each time for a new doc in a loop. Not time efficient but guarantees high precision.

May 01 '24 07:05 eshaanrathi2

While I'm glad you are satisfied with your code, I must note that if you are resetting your index for each vectorstore on the fly then you have no need for a vectorstore in the first place. It also does nothing to address the OP's issue of low retriever recall on larger corpora, which is improved through a mixture of better feature representation and better querying.

May 01 '24 07:05 hinthornw

@hinthornw agreed, that's a valid point. However splitting documents and doing similarity search is easy and precise with Langchain chroma vectorstore. I couldn't find better alternatives without creating a vector store.

May 01 '24 08:05 eshaanrathi2

k = 5
query_vec = embeddings.embed_query("my_query")
doc_vecs = np.array(embeddings.embed_documents(docs))
scores = query_vec @ doc_vecs.T
top_docs = [docs[i] for i in np.argsort(-scores)[:k]]

May 01 '24 09:05 hinthornw

langchain langchain copied to clipboard

Similarity search not working well when number of ingested documents is great, say over one hundred.

langchain
langchain copied to clipboard