MetaGPT icon indicating copy to clipboard operation
MetaGPT copied to clipboard

Feat:Add RAG Benchmark method

Open YangQianli92 opened this issue 1 year ago • 1 comments

Features

  • New MetaGPT-RAG assessment module, involving RougL, Bleu, Recall, Hit Rate, MRR and other assessment indicators.
  • Feel free to review the effects of the different modules of RAG.
  • Support customized evaluation dataset, please follow the sample provided by us to modify the structure can be.
  • Added Reranker support for Cohere, FlagEmbedding. image

YangQianli92 avatar Apr 15 '24 07:04 YangQianli92

Codecov Report

Attention: Patch coverage is 8.41121% with 98 lines in your changes are missing coverage. Please review.

Project coverage is 70.26%. Comparing base (933d6c1) to head (debe6b0). Report is 22 commits behind head on main.

Files Patch % Lines
metagpt/rag/benchmark/base.py 0.00% 86 Missing :warning:
metagpt/rag/factories/ranker.py 16.66% 10 Missing :warning:
metagpt/rag/benchmark/__init__.py 0.00% 2 Missing :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1193      +/-   ##
==========================================
- Coverage   70.60%   70.26%   -0.34%     
==========================================
  Files         314      316       +2     
  Lines       18714    18821     +107     
==========================================
+ Hits        13213    13225      +12     
- Misses       5501     5596      +95     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Apr 15 '24 07:04 codecov-commenter

/review

geekan avatar Apr 22 '24 07:04 geekan

PR Review

⏱️ Estimated effort to review [1-5]

4, due to the extensive amount of new code across multiple files, involving complex functionalities such as data retrieval, ranking, and evaluation metrics. The PR integrates new features and configurations which require careful review to ensure correctness and performance.

🧪 Relevant tests

No

🔍 Possible issues

Possible Bug: The method rag_evaluate_single in rag_bm.py might return incorrect metrics if an exception is thrown and caught. The method catches all exceptions and returns a default metric set which might not accurately reflect the error state or provide meaningful feedback for debugging.

Performance Concern: The extensive use of synchronous file I/O operations and potentially large data processing in loops could lead to performance bottlenecks, especially noticeable when processing large datasets or when used in a high-latency network environment.

🔒 Security concerns

No

Code feedback:
relevant fileexamples/rag_bm.py
suggestion      

Consider implementing more granular exception handling in the rag_evaluate_single method to differentiate between different types of errors (e.g., network issues, data format errors) and handle them appropriately. This will improve the robustness and debuggability of the module. [important]

relevant lineexcept Exception as e:

relevant fileexamples/rag_bm.py
suggestion      

To enhance performance, consider using asynchronous file operations or a more efficient data handling mechanism to manage I/O operations, especially when loading or writing large datasets in the rag_evaluate_pipeline method. [important]

relevant linewrite_json_file((EXAMPLE_BENCHMARK_PATH / dataset.name / "bm_result.json").as_posix(), results, "utf-8")

relevant filemetagpt/rag/benchmark/base.py
suggestion      

Optimize the compute_metric method by caching results of expensive operations like bleu_score and rougel_score if the same responses and references are being evaluated multiple times. This can significantly reduce computation time in scenarios with repetitive data. [medium]

relevant linebleu_avg, bleu1, bleu2, bleu3, bleu4 = self.bleu_score(response, reference)

relevant fileexamples/rag_bm.py
suggestion      

Refactor the rag_evaluate_pipeline method to break down its functionality into smaller, more manageable functions. This improves modularity and makes the code easier to maintain and test. [medium]

relevant lineasync def rag_evaluate_pipeline(self, dataset_name: list[str] = ["all"]):


✨ Review tool usage guide:

Overview: The review tool scans the PR code changes, and generates a PR review which includes several types of feedbacks, such as possible PR issues, security threats and relevant test in the PR. More feedbacks can be added by configuring the tool.

The tool can be triggered automatically every time a new PR is opened, or can be invoked manually by commenting on any PR.

  • When commenting, to edit configurations related to the review tool (pr_reviewer section), use the following template:
/review --pr_reviewer.some_config1=... --pr_reviewer.some_config2=...
[pr_reviewer]
some_config1=...
some_config2=...

See the review usage page for a comprehensive guide on using this tool.

qodo-merge-pro[bot] avatar Apr 22 '24 07:04 qodo-merge-pro[bot]

lgtm

better629 avatar Apr 24 '24 10:04 better629

In the PR submitted above, there is a slight error in the MRR calculation of the Benchmark metrics, I have submitted another PR for fixing this bug, and all the results are recalculated after fixing the bug! https://github.com/geekan/MetaGPT/pull/1228

YangQianli92 avatar Apr 26 '24 14:04 YangQianli92