The EM and Precision are too low.
gpt-4.1-mini/quality/Dalk: Accuracy = 0.3936 gpt-4.1-mini/Popqa/RAPTOR: Acc = 0.0222, EM = 0.0000, F1 = 0.0189, P = 0.0143, R = 0.0556 gpt-4o/multihop-rag/RAPTOR: Acc = 0.5814, EM = 0.0012, F1 = 0.0263, P = 0.0146, R = 0.3602 gpt-4o/multihop-rag/Dalk: Acc = 0.6491, EM = 0.0814, F1 = 0.1258, P = 0.1080, R = 0.3347 gpt-4o/multihop-rag/HippoRAG: Acc = 0.6463, EM = 0.0000, F1 = 0.0213, P = 0.0111, R = 0.3550 gpt-4o/quality/RAPTOR: Accuracy = 0.4752 gpt-4-turbo/multihop-rag/default: Acc = 0.4730, EM = 0.1166, F1 = 0.1752, P = 0.1585, R = 0.2452
Here is some results I have. The Acc and Recall are close to the table in the paper, but the EM and Precision are too low. Do you have any suggestion for this issue? For example, do we need limit the output length prompts cause the generated answer is redundant now.
Thanks for your question. I think this is a common issue for all graph-based rag methods. Indeed, for some methods, they usually use a specific strategy to solve this problem. In our paper or project, we directly use the evaluation code provided by HippoRAG. Maybe you can refer to some other papers, to check how they evaluate the metrics. for example, truncate or further process the output, to only preserve the key information to reduce the redundant, thus enabling a more "pretty" metrics.
Rock @.***> 于2025年6月21日周六 03:58写道:
rockcor created an issue (JayLZhou/GraphRAG#78) https://github.com/JayLZhou/GraphRAG/issues/78
gpt-4.1-mini/quality/Dalk: Accuracy = 0.3936 gpt-4.1-mini/Popqa/RAPTOR: Acc = 0.0222, EM = 0.0000, F1 = 0.0189, P = 0.0143, R = 0.0556 gpt-4o/multihop-rag/RAPTOR: Acc = 0.5814, EM = 0.0012, F1 = 0.0263, P = 0.0146, R = 0.3602 gpt-4o/multihop-rag/Dalk: Acc = 0.6491, EM = 0.0814, F1 = 0.1258, P = 0.1080, R = 0.3347 gpt-4o/multihop-rag/HippoRAG: Acc = 0.6463, EM = 0.0000, F1 = 0.0213, P = 0.0111, R = 0.3550 gpt-4o/quality/RAPTOR: Accuracy = 0.4752 gpt-4-turbo/multihop-rag/default: Acc = 0.4730, EM = 0.1166, F1 = 0.1752, P = 0.1585, R = 0.2452
Here is some results I have. The Acc and Recall are close to the table in the paper, but the EM and Precision are too low. Do you have any suggestion for this issue? For example, do we need limit the output length prompts cause the generated answer is redundant now.
— Reply to this email directly, view it on GitHub https://github.com/JayLZhou/GraphRAG/issues/78, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMKM7U7NE67JIEXZ4MEQHML3ERRUPAVCNFSM6AAAAAB7ZCYPOWVHI2DSMVQWIX3LMV43ASLTON2WKOZTGE3DIMRVGQ2DGNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thanks for your suggestions! The postprocess makes sense. However, I think limit the length of output is better. In your template
"""---Role---
You are a helpful assistant responding to questions about data in the tables provided.
---Goal---
Generate a response of the target length and format that responds to the user's question, summarizing all information in the input data tables appropriate for the response length and format, and incorporating any relevant general knowledge.
If you don't know the answer, just say so. Do not make anything up.
Do not include information where the supporting evidence for it is not provided.
---Target response length and format---
{response_type}
---Data tables---
{context_data}
Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown.
"""
It also mentioned "Target response length and format". But there is no length in {response type} actually.