[Feature Request] Refactor `BaseBenchmark`
Required prerequisites
- [X] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
- [ ] Consider asking first in a Discussion.
Motivation
As we integrate more benchmarks into camel, the current BaseBenchmark could be refactored to provide broader support for various types of benchmarks, like the run method in current BaseBenchmark
Solution
No response
Alternatives
No response
Additional context
No response
Hi @Wendong-Fan, I'm interested in working on this issue. As I’ve been contributing to the Coderag-Benchmark integration, I’m now quite familiar with the different benchmark structures in the codebase.
Could you help clarify the goal of this refactor?
- Is the goal mainly to make
BaseBenchmark.run()more compatible with implementations like Gaia, APIBank, and RagBench? Right now the signatures differ, but everything works in practice. - Benchmarks follow different patterns—e.g., Gaia uses
__init__()to pass the retriever, RagBench usesrun(), and APIBank/nexus/APIBench do not need one. Given this diversity, is the aim simply to make the base class flexible enough to support these variations? For example, would including a retriever parameter in both__init__()andrun()be a reasonable direction, to align all the current benchmarks?
That would be awesome!! c.c @Wendong-Fan for commenting
Hi @boerz-coding ,
Thanks for your input, and apologies for the delay in my reply.
Regarding the BaseBenchmark refactoring:
- Scope: The goal is broader than just refactoring
.run(); we aim to improve other base class functions too. I highlighted.run()as an example because its current specific signature struggles to accommodate diverse benchmark requirements. - Objectives: Our main aims are:
a. To make the base class significantly more flexible to support various benchmark implementations.
b. To standardize interfaces across benchmark modules. This could involve defining a more general
EvalResultmodel inBaseBenchmark, as discussed in [this PR comment](https://github.com/camel-ai/camel/pull/2293#discussion_r2072368289) (cc @sunchengxuanivy).
Here are some detailed suggestions for refactoring, but our implementation should not limited to these points:
__init__:- Make
save_to: Optional[str] = None. - Update docstrings: Clarify
data_diris a suggested local path; subclasses might use other mechanisms (like caches). - Keep
self._results: List[Dict[str, Any]] = [], but acknowledge subclasses might manage results differently.
- Make
download:- Keep as
@abstractmethod. Subclasses must implement their specific data acquisition logic.
- Keep as
load:- Remove
@abstractmethod. Provide a default implementation (perhaps raisingNotImplementedErroror doing nothing). - Remove
force_downloadparameter from the base signature. - Remove the expectation that
loadpopulatesself._datain a specific format. Subclasses should be fully responsible for their internal data loading and storage.
- Remove
- Data Access Properties (
train,valid,test):- Remove these properties entirely from the base class, as they are incompatible with diverse subclass data structures. Subclasses should provide their own methods for accessing data splits/items if needed.
run:- Make it
@abstractmethod. - Use a generic signature:
def run(self, *args, **kwargs) -> Any:. - Update docstring: Emphasize that subclasses must define their required arguments (e.g., agent(s), retrievers, dataset specifics) and return type (e.g.,
self, metrics dictionary).
- Make it
evaluate:- Add an optional, non-abstract method:
def evaluate(self) -> Dict[str, Any]:. - The default implementation should raise
NotImplementedErrorto clearly signal it needs overriding by subclasses that perform evaluation. - Docstring should state its purpose: process results (potentially from
self._resultsor elsewhere) and return calculated metrics.
- Add an optional, non-abstract method:
Thank you @Wendong-Fan for the explanation — I think this direction makes perfect sense! If @sunchengxuanivy is interested, I’d be totally happy for @sunchengxuanivy to take it forward. Otherwise, I’d be glad to take it on myself, or work on it together. @sunchengxuanivy, feel free to let me know what you think!
https://github.com/camel-ai/camel/pull/2293#discussion_r2083319389
to simplify message for html report generation.