[Bug]: Mismatch between header in community report generation prompt examples and input data (id vs human_readable_id)

Open ksachdeva opened this issue 1 year ago • 1 comments

Do you need to file an issue?

[X] I have searched the existing issues and this bug is not already filed.
[x] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
[X] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

Above is an image showing the portion of community_report extraction prompt.

Entities

id,entity,description


Relationships

id,source,target,description

The context data that is generated ends up having headers that are a bit different

Entities
human_readable_id,title,description


Relationships

human_readable_id,source,target,description

The differences are

For Entities - generate context has human_readable_id instead of id and title instead of entities For Relationships - generate context has human_readable_id instead of id

The bug can be fixed by modifying the code or making the headers in the prompt compliant with the ones generated by the code.

For code, you would need modifications in these verbs.

This is happening because of https://github.com/microsoft/graphrag/blob/c749fe2a151b9e8259bf4fef2f6c45cf82f1181e/graphrag/index/verbs/graph/report/prepare_community_reports_nodes.py#L37

and

https://github.com/microsoft/graphrag/blob/c749fe2a151b9e8259bf4fef2f6c45cf82f1181e/graphrag/index/verbs/graph/report/prepare_community_reports_edges.py#L38

Steps to reproduce

You can look into the cache files. I am attaching one here.

example_community_report_cache.txt

Expected Behavior

No response

GraphRAG Config Used

# Paste your config here

Logs and screenshots

No response

Additional Information

GraphRAG Version:
Operating System:
Python Version:
Related Issues:

Aug 07 '24 14:08 ksachdeva

Nice find and thorough reporting @ksachdeva. @AlonsoGuevara can you confirm and update the df/prompt? I'm not aware of any direct bugs in search results due to this, so I suspect the LLM is smart enough to recognize the intent of the column name, but it would be sensible to be as precise as possible.

Aug 09 '24 00:08 natoverse