ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Question]: Different chunks returned in RAGFlow vs LangChain

Open rplescia opened this issue 1 year ago • 5 comments

Describe your problem

I've been testing queries in RAGflow using the embedding model (all-MiniLM-L6-v2) and 'General' method to chunk and embed documents and my ollama (Llama3.1 8B) for LLM inference. I am comparing the results to a streamlit/langchain app that I built myself and I'm getting very different results for the document chucks that are returned as part of the RAG query.

I have been comparing results returned during retrieval testing in RAGflow versus the chunks returned when I do the same search on my solution built on langchain+faiss. I seem to be getting much more accurate results with the langchain+faiss solution. Is there any way I can improve the accuracy of the chunks returned in RAGflow as it is not providing the LLM with the correct chunks at the moment?

rplescia avatar Oct 30 '24 14:10 rplescia

Could you elaberate on your cases? And please try this setup: image

KevinHuSh avatar Oct 31 '24 02:10 KevinHuSh

I am reviewing legal documents like large loan agreements and contracts. I have spent much time trying different chunking methods, delimiters, token numbers, keywords etc. From what I can see the 'problem' is not with the chunking of the documents but rather the results returned by the vector search.

I get similar chucks returned when I ask different questions which means I rarely get the correct answer. I have also noticed that the Vector similarity for the results are 0 so I'm not sure it is even doing a vector search.

image

rplescia avatar Oct 31 '24 09:10 rplescia

Weird! Vector sim can't be 0 anyway. What's your embedding model?

KevinHuSh avatar Nov 01 '24 02:11 KevinHuSh

I thought it was weird too. I'm using the all-MiniLM-L6-v2 model that comes with 'full' build. I can see the document chunks and the embedding seems to run without any error, but when testing retrieval no vector search is performed.

image

rplescia avatar Nov 04 '24 07:11 rplescia

I have done more testing, and it affects some models, not others. I have tested with all the built-in models with the same parameters and the same file, Nomic and Jinaai are the only models that return a Vector Similarity score when doing retrieval testing.

rplescia avatar Nov 05 '24 12:11 rplescia

Whoa, same here. Tried LangChain-faiss vs. another RAG pipeline—thought I was losing my mind with how different the chunks were. Turns out it wasn’t me. It was the chunker, the tokenizer, and the whole pipeline logic gaslighting me 😅

Your instinct is 100% right. The default “General” chunking in most RAG flows? It’s like chopping a book with kitchen scissors blindfolded—tokens don’t respect meaning.

I recently wrote a PDF about this insanity, especially around chunk mismatch, semantic drift, and “false precision” during vectorization. You might like it:

📄 https://github.com/onestardao/WFGY

Key idea:

Token count ≠ context precision. You want semantic-tension–aware chunking, where the chunk boundaries bend around meaning, not characters.

And hey, bonus: this idea got a rare nod from the creator of tesseract.js (yeah, the OCR legend—36k GitHub stars). So I’m either onto something... or I fooled a genius. Either way, worth a skim.

Hang in there. You're not chunking alone 💥

onestardao avatar Jul 23 '25 11:07 onestardao