graphrag
graphrag copied to clipboard
[Feature Request]: Improved coreference resolution when building knowledge graph
Do you need to file an issue?
- [X] I have searched the existing issues and this feature is not already filed.
- [ ] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
- [X] I believe this is a legitimate feature request, not just a question. If this is a question, please use the Discussions area.
Is your feature request related to a problem? Please describe.
Duplicate or closely related entities are present in the knowledge graph after the indexation phase, decreasing the semantic and structural quality of the graph.
During the indexation process, the current graph extraction process is to prompt a LLM up _max_gleanings
times to extract entities and relationships, as defined in the GraphExtractor
class. If this iterative approach works well to increase the number of entities extracted from the input document, as seen in the GraphRAG paper figure 2 - see below - I notice in my usage it also brings several duplicates entities that refer to the same real world concept/entity, yielding a noisy knowledge graph that is not as actionnable as it coul be.
Describe the solution you'd like
Add coreference resolution in the graph extraction during indexation phase.
In the GraphExtractor
class, once we perform the _process_document
loop, we could add a step of coreference resolution on the extracted entities before we proceed to build the networkx graph
Additional context
No response