[Feature Request]: Improved coreference resolution when building knowledge graph

Open fpaupier opened this issue 1 year ago • 0 comments

Do you need to file an issue?

[X] I have searched the existing issues and this feature is not already filed.
[ ] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
[X] I believe this is a legitimate feature request, not just a question. If this is a question, please use the Discussions area.

Is your feature request related to a problem? Please describe.

Duplicate or closely related entities are present in the knowledge graph after the indexation phase, decreasing the semantic and structural quality of the graph.

During the indexation process, the current graph extraction process is to prompt a LLM up _max_gleanings times to extract entities and relationships, as defined in the GraphExtractor class. If this iterative approach works well to increase the number of entities extracted from the input document, as seen in the GraphRAG paper figure 2 - see below - I notice in my usage it also brings several duplicates entities that refer to the same real world concept/entity, yielding a noisy knowledge graph that is not as actionnable as it coul be.

Screenshot 2024-10-03 at 07 10 26

Describe the solution you'd like

Add coreference resolution in the graph extraction during indexation phase.

In the GraphExtractor class, once we perform the _process_document loop, we could add a step of coreference resolution on the extracted entities before we proceed to build the networkx graph

Additional context

No response

Oct 03 '24 05:10 fpaupier