Generated graph contains entities that do not appear in the original text
I have been running the graphrag accelerator (with GPT4-o on AOAI) on the complete works of Sir Arthur Connan Doyle, without modifying the entity configuration.
Observing the retrieved graph using Gephi shows a cluster of interconnected nodes which do not exist in the source text. Each node has a source_id pointing to a text_unit chunk, but when I retrieve those chunks from create_bsed_text_units.parquet, they do not contain any mention of these entities.
The entity labels are:
- Iran
- Tehran
- Emad Shargi
- Evin prison
- Siamak Nazari
- There might be more, but these caught my eye
Further search in cache/entity_extraction shows these bogus entities returned by GPT. See 'sherlock4/cache/entity_extraction\\chat-00bc8a92b836ff7c6890f16e0d8a3bd5' for example.
Link to all the files that were stored on Blob: https://www.dropbox.com/scl/fi/nmx5nrdt5w6rx4t7chitp/sherlock_holmes_graphrag.zip?rlkey=tkbakww4cuogsmt8f94rduea9&dl=0