graphrag icon indicating copy to clipboard operation
graphrag copied to clipboard

Generated graph contains entities that do not appear in the original text

Open eyast opened this issue 1 year ago • 2 comments

I have been running the graphrag accelerator (with GPT4-o on AOAI) on the complete works of Sir Arthur Connan Doyle, without modifying the entity configuration.

Observing the retrieved graph using Gephi shows a cluster of interconnected nodes which do not exist in the source text. Each node has a source_id pointing to a text_unit chunk, but when I retrieve those chunks from create_bsed_text_units.parquet, they do not contain any mention of these entities. The entity labels are:

  • Iran
  • Tehran
  • Emad Shargi
  • Evin prison
  • Siamak Nazari
  • There might be more, but these caught my eye

Further search in cache/entity_extraction shows these bogus entities returned by GPT. See 'sherlock4/cache/entity_extraction\\chat-00bc8a92b836ff7c6890f16e0d8a3bd5' for example. Link to all the files that were stored on Blob: https://www.dropbox.com/scl/fi/nmx5nrdt5w6rx4t7chitp/sherlock_holmes_graphrag.zip?rlkey=tkbakww4cuogsmt8f94rduea9&dl=0

eyast avatar Jul 03 '24 05:07 eyast