LightRAG icon indicating copy to clipboard operation
LightRAG copied to clipboard

[Question]: Why does knowledge graph contain nodes that are not available in the documents?

Open duguyue100 opened this issue 8 months ago • 4 comments

Do you need to ask a question?

  • [x] I have searched the existing question and discussions and this question is not already answered.
  • [x] I believe this is a legitimate question, not just a bug or feature request.

Your Question

Thanks for open-sourcing the project. I've tried with a naive setup that uses the nomic-embed-text as the text embedder and Llama 3.2 3B as the LLM. I fed all my doc pages into LightRAG, and to my surprise, there are so many nodes in the knowledge graph that has nothing to do with the docs I put in (140 markdown technical documents). Does anyone have an idea what's going on? I used the lightrag-serve function. Besides configuring the text embedder and LLM (with context window 32k), I used all default settings.

Additional Context

Image

Image

duguyue100 avatar Apr 11 '25 08:04 duguyue100

This is just speculation, but you could add logging like "print(answer_llm)" to the console and check that the model output is what you expect, or just take the "entity_extraction" prompt with and example chunk and paste it in the powershell after running "ollama run llama3.2:3b" and see what the output is.

From my experience any model below 14b parameters isn't going to be good enough for extracting the entities. The nodes/edges you are seeing are probably just hallucinations from your small model.

Even bigger models like gpt-4o-mini might some times not fully extract what you wanted to be extracted. Allthough, I have to say that for price/quality and what I have tested gpt-4o-mini is perfect.

You could however first use a model like gpt-4o-mini for extraction and then when everything is done switch to a smaller model.

Hope this helps.

frederikhendrix avatar Apr 11 '25 09:04 frederikhendrix

I see, thanks! I will give it a try, I didn't expect a small LLM that can hallucinate this much. I don't have the freedom to use a public model right now, so maybe I would just switch to a bigger model and give it another push.

duguyue100 avatar Apr 11 '25 10:04 duguyue100

I see, thanks! I will give it a try, I didn't expect a small LLM that can hallucinate this much. I don't have the freedom to use a public model right now, so maybe I would just switch to a bigger model and give it another push.

Yes I would always try to use the biggest model you can run for extracting the data. (Eventhough it could take a very long time to extract it). Also a tip is to just start with one chunk.

And after the extraction is done with the big model you can switch to the small model to ask questions, but keep the embedding model the same.

frederikhendrix avatar Apr 11 '25 10:04 frederikhendrix

At least use model of 32B, the bigger the better, embedding is also importance for query result, nomic-embed-text is not a good choice.

We recommend using a model with at least 32B parameters, the bigger the better. Additionally, the choice of embedding model significantly impacts query performance. nomic-embed-text is not a good choice.

danielaskdd avatar Apr 12 '25 00:04 danielaskdd

@danielaskdd Really? This kind of know-how is very important, as I heard everywhere that nomic-embed-text is a high-performance one. Are there top ones you would recommend?

duguyue100 avatar Apr 14 '25 08:04 duguyue100

Numerous benchmark test results are publicly available via Google search. Personal recommendation:

OpenAI:

  • text-embedding-3-large
  • text-embedding-ada-002

Jina AI:

  • jina-embeddings-v3

Local Deployment:

  • BGE-M3

danielaskdd avatar Apr 14 '25 10:04 danielaskdd