[Bug]: Mismatch between header in community report generation prompt examples and input data (id vs human_readable_id)
Do you need to file an issue?
- [X] I have searched the existing issues and this bug is not already filed.
- [x] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
- [X] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.
Describe the bug
Above is an image showing the portion of community_report extraction prompt.
Entities
id,entity,description
Relationships
id,source,target,description
The context data that is generated ends up having headers that are a bit different
Entities
human_readable_id,title,description
Relationships
human_readable_id,source,target,description
The differences are
For Entities - generate context has human_readable_id instead of id and title instead of entities
For Relationships - generate context has human_readable_id instead of id
The bug can be fixed by modifying the code or making the headers in the prompt compliant with the ones generated by the code.
For code, you would need modifications in these verbs.
This is happening because of https://github.com/microsoft/graphrag/blob/c749fe2a151b9e8259bf4fef2f6c45cf82f1181e/graphrag/index/verbs/graph/report/prepare_community_reports_nodes.py#L37
and
https://github.com/microsoft/graphrag/blob/c749fe2a151b9e8259bf4fef2f6c45cf82f1181e/graphrag/index/verbs/graph/report/prepare_community_reports_edges.py#L38
Steps to reproduce
You can look into the cache files. I am attaching one here.
example_community_report_cache.txt
Expected Behavior
No response
GraphRAG Config Used
# Paste your config here
Logs and screenshots
No response
Additional Information
- GraphRAG Version:
- Operating System:
- Python Version:
- Related Issues:
Nice find and thorough reporting @ksachdeva. @AlonsoGuevara can you confirm and update the df/prompt? I'm not aware of any direct bugs in search results due to this, so I suspect the LLM is smart enough to recognize the intent of the column name, but it would be sensible to be as precise as possible.