graphrag icon indicating copy to clipboard operation
graphrag copied to clipboard

Consider adding an entity reconciliation step that merges nodes that seem to be duplicate

Open eyast opened this issue 1 year ago • 1 comments

When running GraphRAG on a story such as the complete works of Sherlock Holmes the generated graph contains individual nodes which should have been consolidated into one. For example, there are unique nodes for:

  • Sherlock
  • Sherlock Holmes
  • Mr. Holmes
  • Holmes
  • etc

Other nodes in the graph seem to be sparingly connected. For example "Baker Street" has an edge with "Mr. Holmes" but no other variants. I suspect this might lead to unique cluster formations that might affect downstream summarization. Should there be an optional step that attempts at reconciling these entities? I imagine there might not be a single blanket approach to do this (I can imagine many edge cases where the output above might be correct in another context), but maybe ask the user to mix and match to select if she wants to 'fuse' the node 'Sherlock' with the node 'Sherlock Holmes', concatenating them into one?

For reference, output of the artifacts folder including graphs etc: https://www.dropbox.com/scl/fi/hfo1nppit6tfczrypc7tb/sherlock-holmes-artifacts.zip?rlkey=t3sx7g3q48tw5fl2eek6tg3la&dl=0

eyast avatar Jul 06 '24 23:07 eyast

Which model have you used? GPT-4o?

The documentation mentions a "destructive entity resolution" step which is not enabled by default. I believe it is not implemented at all in the currently released code base (or I would not know where to find it):

image

It should be possible to manually resolve entities and update the graph but a best-effort optional approach would be great!

COPILOT-WDP avatar Jul 07 '24 10:07 COPILOT-WDP

Closing as duplicate with #113 that is tracking our entity resolution feature

natoverse avatar Aug 06 '24 22:08 natoverse