[Question]: Why does knowledge graph contain nodes that are not available in the documents?
Do you need to ask a question?
- [x] I have searched the existing question and discussions and this question is not already answered.
- [x] I believe this is a legitimate question, not just a bug or feature request.
Your Question
Thanks for open-sourcing the project. I've tried with a naive setup that uses the nomic-embed-text as the text embedder and Llama 3.2 3B as the LLM. I fed all my doc pages into LightRAG, and to my surprise, there are so many nodes in the knowledge graph that has nothing to do with the docs I put in (140 markdown technical documents). Does anyone have an idea what's going on? I used the lightrag-serve function. Besides configuring the text embedder and LLM (with context window 32k), I used all default settings.
Additional Context
This is just speculation, but you could add logging like "print(answer_llm)" to the console and check that the model output is what you expect, or just take the "entity_extraction" prompt with and example chunk and paste it in the powershell after running "ollama run llama3.2:3b" and see what the output is.
From my experience any model below 14b parameters isn't going to be good enough for extracting the entities. The nodes/edges you are seeing are probably just hallucinations from your small model.
Even bigger models like gpt-4o-mini might some times not fully extract what you wanted to be extracted. Allthough, I have to say that for price/quality and what I have tested gpt-4o-mini is perfect.
You could however first use a model like gpt-4o-mini for extraction and then when everything is done switch to a smaller model.
Hope this helps.
I see, thanks! I will give it a try, I didn't expect a small LLM that can hallucinate this much. I don't have the freedom to use a public model right now, so maybe I would just switch to a bigger model and give it another push.
I see, thanks! I will give it a try, I didn't expect a small LLM that can hallucinate this much. I don't have the freedom to use a public model right now, so maybe I would just switch to a bigger model and give it another push.
Yes I would always try to use the biggest model you can run for extracting the data. (Eventhough it could take a very long time to extract it). Also a tip is to just start with one chunk.
And after the extraction is done with the big model you can switch to the small model to ask questions, but keep the embedding model the same.
At least use model of 32B, the bigger the better, embedding is also importance for query result, nomic-embed-text is not a good choice.
We recommend using a model with at least 32B parameters, the bigger the better. Additionally, the choice of embedding model significantly impacts query performance. nomic-embed-text is not a good choice.
@danielaskdd Really? This kind of know-how is very important, as I heard everywhere that nomic-embed-text is a high-performance one. Are there top ones you would recommend?
Numerous benchmark test results are publicly available via Google search. Personal recommendation:
OpenAI:
- text-embedding-3-large
- text-embedding-ada-002
Jina AI:
- jina-embeddings-v3
Local Deployment:
- BGE-M3