langchain
langchain copied to clipboard
Similarity search not working well when number of ingested documents is great, say over one hundred.
When few documets embedded into vector db everything works fine, with similarity search I can always find the most relevant documents on the top of results. But when it comes to over hundred, searching result will be very confusing, given the same query I could not find any relevant documents. I've tried Chroma, Faiss, same story.
Anyone has any idea? thanks. btw, the documents are in Chinese...
Yeah, I experience the same. I think it's related to the embedding method used. In my experience, OpenAI embedding seems to work better than the ones from HuggingFace.
@20001LastOrder Thanks, it must be embedding issue for Chinese content, I've tried ada-embedding, huggingface sententce embedding, and chromadb default embedding, among these three solutions ada-embedding seems outperforms a little, but still missing accuracy as the number of documents increased.
I would like to know if anyone having similar expierence when using English document?
189 pages pdf document of mine with FOR loop similarity with score takes poor result. It does not return response even a page having same words as my query. for this trial, HuggingfaceEmbeddings with FAISS or Chroma was utilized.
189 pages pdf document of mine with FOR loop similarity with score takes poor result. It does not return response even a page having same words as my query. for this trial, HuggingfaceEmbeddings with FAISS or Chroma was utilized.
English content?
Yes, it is written in English.
i'm having similar issues with English content using LlamaCppEmbeddings. in fact, most relevant document is often the last or second to last document in the list which makes it essentially impossible to do question answering with document context using LlamaCpp.
i've tried Chroma, FAISS and DataLake vectorstores. i've also tried similarity_search and max_marginal_relevance with vector based and text based inputs. all seem to suffer from this problem which leads me to believe that it's an issue with LlamaCppEmbeddings
testing with the default embedding function SentenceTransformerEmbeddingFunction and i'm getting far more predictable and reasonable output from similarity_search with Chroma.
it's surprising to me that LlamaCppEmbeddings is doing such a poor job. i'll be filing a ticket with llama-cpp-python because i don't see how the problem could be with the very simple implementation in langchain -- it is doing almost nothing itself.
filed https://github.com/abetlen/llama-cpp-python/issues/105
Yes, it is written in English.
For English content I believed that ada-embedding will be your best choise, the higher demenion of the vector the better performance.
Thanks for creating this issue! It touches on a very important point that generic pre-trained embeddings usually don't provide the best recall for retrieval on your particular domain. OpenAI is "easy" to set up, but you'll have to evaluate for your specific use case
Yes, it is written in English.
For English content I believed that ada-embedding will be your best choise, the higher demenion of the vector the better performance.
Thanks for advice. I agree that openAI will provide good result. But I would like to have suitably useful result from other models such as HuggingfaceEmbedding.
I see, I have a friend who tested over 100 embedding models, he ranked top performed models, I will ask him which one is best alternative of ada, will get back to you once I have his reply.
I see, I have a friend who tested over 100 embedding models, he ranked top performed models, I will ask him which one is best alternative of ada, will get back to you once I have his reply.
Hello, I was just wondering if you've had a chance to hear back from your friend regarding the best alternative to ada. If you have any information, I would appreciate it if you could share it with us :).
I see, I have a friend who tested over 100 embedding models, he ranked top performed models, I will ask him which one is best alternative of ada, will get back to you once I have his reply.
Hello, I was just wondering if you've had a chance to hear back from your friend regarding the best alternative to ada. If you have any information, I would appreciate it if you could share it with us :).
distiluse-base-multilingual-cased-v1 and distiluse-base-multilingual-cased-v2, shibing624/text2vec-base-chinese
I'm also facing a similar problem. Is there a way to fine-tune embedding models on custom datasets (e.g. text from 100 PDFs)? For this purpose, do I need to label the data manually into sentence pairs?
For OAI https://github.com/openai/openai-cookbook/blob/main/examples/Customizing_embeddings.ipynb For Huggingface - https://huggingface.co/blog/how-to-train-sentence-transformers
I posted a question to another thread but think this might be useful here too. The sentence transformers have a truncation length and I suspect that if the text splitter creates document that are larger that what the embedding model can ingest, the embeddings will be of poor quality.
Sadly, the multilingual sentence transformers all have a truncation length of 128 tokens !
See my question here: https://github.com/hwchase17/langchain/issues/2026#issuecomment-1601097513
met the same problem .
Same here. example 25057b
(copying my message from there )
I'm sharing my own 'rolling' sbert script to avoid clipping the sentences. It's seemingly functionnal but not very elegent, a class would be better of course but I just hope it helps someone :
Met the same issue The way our team handled it is to first reduce search space/document number by labeling the docs(by the "abstract" or something) in advance and do vector similarity search for those labels, and then do the similarity search in the reduced space. Also the way you split documentation is very important, the longer the chunks the more difficult to match that key word. So we have tried to split the chunks by fixed number of characters to make the chunks smaller.
Hope this helps!
I would like to know if anyone having similar expierence when using English document?
Yes I have the same issue over 100 docs.
Met the same issue The way our team handled it is to first reduce search space/document number by labeling the docs(by the "abstract" or something) in advance and do vector similarity search for those labels
@luozixuan Hi, May I know how do you guys label? For large document, the similarity search could not find the right matched chunk or even if it found the right chunk, throwing too much chunks to gpt makes gpt unable to make right answer. I was thinking to create metadata of each chunk such as keywords but I do not know how to create keywords of chunk. So i am interested how you are labeling the documents.
Met the same issue The way our team handled it is to first reduce search space/document number by labeling the docs(by the "abstract" or something) in advance and do vector similarity search for those labels
@luozixuan Hi, May I know how do you guys label? For large document, the similarity search could not find the right matched chunk or even if it found the right chunk, throwing too much chunks to gpt makes gpt unable to make right answer. I was thinking to create metadata of each chunk such as keywords but I do not know how to create keywords of chunk. So i am interested how you are labeling the documents.
I will share my scenario, say I want to infer the customer typical opinion from million customer reviews, the content of the reviews are somewhat similar, I want to have LLM thoroughly skim all the trunks of reviews(let's say 200 reviews per trunk), the only solution I can think of so far is map-reduce, I'm not sure if you guys have any better thought?
Vector representation works great with frequent entries but gives poor results at the tail of query distribution. In such cases, a simple keyword yields better IR.
Workaround I found is to purge vector db from memory and create a new vectorstore each time for a new doc in a loop. Not time efficient but guarantees high precision.
While I'm glad you are satisfied with your code, I must note that if you are resetting your index for each vectorstore on the fly then you have no need for a vectorstore in the first place. It also does nothing to address the OP's issue of low retriever recall on larger corpora, which is improved through a mixture of better feature representation and better querying.
@hinthornw agreed, that's a valid point. However splitting documents and doing similarity search is easy and precise with Langchain chroma vectorstore. I couldn't find better alternatives without creating a vector store.
k = 5
query_vec = embeddings.embed_query("my_query")
doc_vecs = np.array(embeddings.embed_documents(docs))
scores = query_vec @ doc_vecs.T
top_docs = [docs[i] for i in np.argsort(-scores)[:k]]