MetaGPT Feat:Add RAG Benchmark method

Features

New MetaGPT-RAG assessment module, involving RougL, Bleu, Recall, Hit Rate, MRR and other assessment indicators.
Feel free to review the effects of the different modules of RAG.
Support customized evaluation dataset, please follow the sample provided by us to modify the structure can be.
Added Reranker support for Cohere, FlagEmbedding.

Apr 15 '24 07:04 YangQianli92

Codecov Report

Attention: Patch coverage is 8.41121% with 98 lines in your changes are missing coverage. Please review.

Project coverage is 70.26%. Comparing base (933d6c1) to head (debe6b0). Report is 22 commits behind head on main.

Files	Patch %	Lines
metagpt/rag/benchmark/base.py	0.00%	86 Missing :warning:
metagpt/rag/factories/ranker.py	16.66%	10 Missing :warning:
metagpt/rag/benchmark/__init__.py	0.00%	2 Missing :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1193      +/-   ##
==========================================
- Coverage   70.60%   70.26%   -0.34%     
==========================================
  Files         314      316       +2     
  Lines       18714    18821     +107     
==========================================
+ Hits        13213    13225      +12     
- Misses       5501     5596      +95

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Apr 15 '24 07:04 codecov-commenter

/review

Apr 22 '24 07:04 geekan

PR Review

⏱️ Estimated effort to review [1-5]	4, due to the extensive amount of new code across multiple files, involving complex functionalities such as data retrieval, ranking, and evaluation metrics. The PR integrates new features and configurations which require careful review to ensure correctness and performance.
🧪 Relevant tests	No
🔍 Possible issues	Possible Bug: The method `rag_evaluate_single` in `rag_bm.py` might return incorrect metrics if an exception is thrown and caught. The method catches all exceptions and returns a default metric set which might not accurately reflect the error state or provide meaningful feedback for debugging.
🔍 Possible issues	Performance Concern: The extensive use of synchronous file I/O operations and potentially large data processing in loops could lead to performance bottlenecks, especially noticeable when processing large datasets or when used in a high-latency network environment.
🔒 Security concerns	No

Code feedback:

relevant file	examples/rag_bm.py
suggestion	Consider implementing more granular exception handling in the `rag_evaluate_single` method to differentiate between different types of errors (e.g., network issues, data format errors) and handle them appropriately. This will improve the robustness and debuggability of the module. [important]
relevant line	except Exception as e:

relevant file	examples/rag_bm.py
suggestion	To enhance performance, consider using asynchronous file operations or a more efficient data handling mechanism to manage I/O operations, especially when loading or writing large datasets in the `rag_evaluate_pipeline` method. [important]
relevant line	write_json_file((EXAMPLE_BENCHMARK_PATH / dataset.name / "bm_result.json").as_posix(), results, "utf-8")

relevant file	metagpt/rag/benchmark/base.py
suggestion	Optimize the `compute_metric` method by caching results of expensive operations like `bleu_score` and `rougel_score` if the same responses and references are being evaluated multiple times. This can significantly reduce computation time in scenarios with repetitive data. [medium]
relevant line	bleu_avg, bleu1, bleu2, bleu3, bleu4 = self.bleu_score(response, reference)

relevant file	examples/rag_bm.py
suggestion	Refactor the `rag_evaluate_pipeline` method to break down its functionality into smaller, more manageable functions. This improves modularity and makes the code easier to maintain and test. [medium]
relevant line	async def rag_evaluate_pipeline(self, dataset_name: list[str] = ["all"]):

✨ Review tool usage guide:

Overview: The review tool scans the PR code changes, and generates a PR review which includes several types of feedbacks, such as possible PR issues, security threats and relevant test in the PR. More feedbacks can be added by configuring the tool.

The tool can be triggered automatically every time a new PR is opened, or can be invoked manually by commenting on any PR.

When commenting, to edit configurations related to the review tool (pr_reviewer section), use the following template:

/review --pr_reviewer.some_config1=... --pr_reviewer.some_config2=...

With a configuration file, use the following template:

[pr_reviewer]
some_config1=...
some_config2=...

See the review usage page for a comprehensive guide on using this tool.

Apr 22 '24 07:04 qodo-code-review[bot]

lgtm

Apr 24 '24 10:04 better629

In the PR submitted above, there is a slight error in the MRR calculation of the Benchmark metrics, I have submitted another PR for fixing this bug, and all the results are recalculated after fixing the bug! https://github.com/geekan/MetaGPT/pull/1228

Apr 26 '24 14:04 YangQianli92