[Question]: Different chunks returned in RAGFlow vs LangChain
Describe your problem
I've been testing queries in RAGflow using the embedding model (all-MiniLM-L6-v2) and 'General' method to chunk and embed documents and my ollama (Llama3.1 8B) for LLM inference. I am comparing the results to a streamlit/langchain app that I built myself and I'm getting very different results for the document chucks that are returned as part of the RAG query.
I have been comparing results returned during retrieval testing in RAGflow versus the chunks returned when I do the same search on my solution built on langchain+faiss. I seem to be getting much more accurate results with the langchain+faiss solution. Is there any way I can improve the accuracy of the chunks returned in RAGflow as it is not providing the LLM with the correct chunks at the moment?
Could you elaberate on your cases?
And please try this setup:
I am reviewing legal documents like large loan agreements and contracts. I have spent much time trying different chunking methods, delimiters, token numbers, keywords etc. From what I can see the 'problem' is not with the chunking of the documents but rather the results returned by the vector search.
I get similar chucks returned when I ask different questions which means I rarely get the correct answer. I have also noticed that the Vector similarity for the results are 0 so I'm not sure it is even doing a vector search.
Weird! Vector sim can't be 0 anyway. What's your embedding model?
I thought it was weird too. I'm using the all-MiniLM-L6-v2 model that comes with 'full' build. I can see the document chunks and the embedding seems to run without any error, but when testing retrieval no vector search is performed.
I have done more testing, and it affects some models, not others. I have tested with all the built-in models with the same parameters and the same file, Nomic and Jinaai are the only models that return a Vector Similarity score when doing retrieval testing.
Whoa, same here. Tried LangChain-faiss vs. another RAG pipeline—thought I was losing my mind with how different the chunks were. Turns out it wasn’t me. It was the chunker, the tokenizer, and the whole pipeline logic gaslighting me 😅
Your instinct is 100% right. The default “General” chunking in most RAG flows? It’s like chopping a book with kitchen scissors blindfolded—tokens don’t respect meaning.
I recently wrote a PDF about this insanity, especially around chunk mismatch, semantic drift, and “false precision” during vectorization. You might like it:
📄 https://github.com/onestardao/WFGY
Key idea:
Token count ≠ context precision. You want semantic-tension–aware chunking, where the chunk boundaries bend around meaning, not characters.
And hey, bonus: this idea got a rare nod from the creator of tesseract.js (yeah, the OCR legend—36k GitHub stars). So I’m either onto something... or I fooled a genius. Either way, worth a skim.
Hang in there. You're not chunking alone 💥