MetaGPT
MetaGPT copied to clipboard
Feat:Add RAG Benchmark method
Features
- New MetaGPT-RAG assessment module, involving RougL, Bleu, Recall, Hit Rate, MRR and other assessment indicators.
- Feel free to review the effects of the different modules of RAG.
- Support customized evaluation dataset, please follow the sample provided by us to modify the structure can be.
- Added Reranker support for Cohere, FlagEmbedding.
Codecov Report
Attention: Patch coverage is 8.41121% with 98 lines in your changes are missing coverage. Please review.
Project coverage is 70.26%. Comparing base (
933d6c1) to head (debe6b0). Report is 22 commits behind head on main.
| Files | Patch % | Lines |
|---|---|---|
| metagpt/rag/benchmark/base.py | 0.00% | 86 Missing :warning: |
| metagpt/rag/factories/ranker.py | 16.66% | 10 Missing :warning: |
| metagpt/rag/benchmark/__init__.py | 0.00% | 2 Missing :warning: |
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@ Coverage Diff @@
## main #1193 +/- ##
==========================================
- Coverage 70.60% 70.26% -0.34%
==========================================
Files 314 316 +2
Lines 18714 18821 +107
==========================================
+ Hits 13213 13225 +12
- Misses 5501 5596 +95
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
/review
PR Review
| ⏱️ Estimated effort to review [1-5] |
4, due to the extensive amount of new code across multiple files, involving complex functionalities such as data retrieval, ranking, and evaluation metrics. The PR integrates new features and configurations which require careful review to ensure correctness and performance. |
| 🧪 Relevant tests |
No |
| 🔍 Possible issues |
Possible Bug: The method |
|
Performance Concern: The extensive use of synchronous file I/O operations and potentially large data processing in loops could lead to performance bottlenecks, especially noticeable when processing large datasets or when used in a high-latency network environment. | |
| 🔒 Security concerns |
No |
Code feedback:
| relevant file | examples/rag_bm.py |
| suggestion |
Consider implementing more granular exception handling in the |
| relevant line | except Exception as e: |
| relevant file | examples/rag_bm.py |
| suggestion |
To enhance performance, consider using asynchronous file operations or a more efficient data handling mechanism to manage I/O operations, especially when loading or writing large datasets in the |
| relevant line | write_json_file((EXAMPLE_BENCHMARK_PATH / dataset.name / "bm_result.json").as_posix(), results, "utf-8") |
| relevant file | metagpt/rag/benchmark/base.py |
| suggestion |
Optimize the |
| relevant line | bleu_avg, bleu1, bleu2, bleu3, bleu4 = self.bleu_score(response, reference) |
| relevant file | examples/rag_bm.py |
| suggestion |
Refactor the |
| relevant line | async def rag_evaluate_pipeline(self, dataset_name: list[str] = ["all"]): |
✨ Review tool usage guide:
Overview:
The review tool scans the PR code changes, and generates a PR review which includes several types of feedbacks, such as possible PR issues, security threats and relevant test in the PR. More feedbacks can be added by configuring the tool.
The tool can be triggered automatically every time a new PR is opened, or can be invoked manually by commenting on any PR.
- When commenting, to edit configurations related to the review tool (
pr_reviewersection), use the following template:
/review --pr_reviewer.some_config1=... --pr_reviewer.some_config2=...
- With a configuration file, use the following template:
[pr_reviewer]
some_config1=...
some_config2=...
See the review usage page for a comprehensive guide on using this tool.
lgtm
In the PR submitted above, there is a slight error in the MRR calculation of the Benchmark metrics, I have submitted another PR for fixing this bug, and all the results are recalculated after fixing the bug! https://github.com/geekan/MetaGPT/pull/1228