camel icon indicating copy to clipboard operation
camel copied to clipboard

[Feature Request] Refactor `BaseBenchmark`

Open Wendong-Fan opened this issue 1 year ago • 1 comments

Required prerequisites

  • [X] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
  • [ ] Consider asking first in a Discussion.

Motivation

As we integrate more benchmarks into camel, the current BaseBenchmark could be refactored to provide broader support for various types of benchmarks, like the run method in current BaseBenchmark

Solution

No response

Alternatives

No response

Additional context

No response

Wendong-Fan avatar Dec 18 '24 07:12 Wendong-Fan

Hi @Wendong-Fan, I'm interested in working on this issue. As I’ve been contributing to the Coderag-Benchmark integration, I’m now quite familiar with the different benchmark structures in the codebase.

Could you help clarify the goal of this refactor?

  1. Is the goal mainly to make BaseBenchmark.run() more compatible with implementations like Gaia, APIBank, and RagBench? Right now the signatures differ, but everything works in practice.
  2. Benchmarks follow different patterns—e.g., Gaia uses__init__() to pass the retriever, RagBench uses run(), and APIBank/nexus/APIBench do not need one. Given this diversity, is the aim simply to make the base class flexible enough to support these variations? For example, would including a retriever parameter in both __init__() and run() be a reasonable direction, to align all the current benchmarks?

boerz-coding avatar Apr 26 '25 06:04 boerz-coding

That would be awesome!! c.c @Wendong-Fan for commenting

lightaime avatar Apr 29 '25 02:04 lightaime

Hi @boerz-coding ,

Thanks for your input, and apologies for the delay in my reply.

Regarding the BaseBenchmark refactoring:

  1. Scope: The goal is broader than just refactoring .run(); we aim to improve other base class functions too. I highlighted .run() as an example because its current specific signature struggles to accommodate diverse benchmark requirements.
  2. Objectives: Our main aims are: a. To make the base class significantly more flexible to support various benchmark implementations. b. To standardize interfaces across benchmark modules. This could involve defining a more general EvalResult model in BaseBenchmark, as discussed in [this PR comment](https://github.com/camel-ai/camel/pull/2293#discussion_r2072368289) (cc @sunchengxuanivy).

Here are some detailed suggestions for refactoring, but our implementation should not limited to these points:

  • __init__:
    • Make save_to: Optional[str] = None.
    • Update docstrings: Clarify data_dir is a suggested local path; subclasses might use other mechanisms (like caches).
    • Keep self._results: List[Dict[str, Any]] = [], but acknowledge subclasses might manage results differently.
  • download:
    • Keep as @abstractmethod. Subclasses must implement their specific data acquisition logic.
  • load:
    • Remove @abstractmethod. Provide a default implementation (perhaps raising NotImplementedError or doing nothing).
    • Remove force_download parameter from the base signature.
    • Remove the expectation that load populates self._data in a specific format. Subclasses should be fully responsible for their internal data loading and storage.
  • Data Access Properties (train, valid, test):
    • Remove these properties entirely from the base class, as they are incompatible with diverse subclass data structures. Subclasses should provide their own methods for accessing data splits/items if needed.
  • run:
    • Make it @abstractmethod.
    • Use a generic signature: def run(self, *args, **kwargs) -> Any:.
    • Update docstring: Emphasize that subclasses must define their required arguments (e.g., agent(s), retrievers, dataset specifics) and return type (e.g., self, metrics dictionary).
  • evaluate:
    • Add an optional, non-abstract method: def evaluate(self) -> Dict[str, Any]:.
    • The default implementation should raise NotImplementedError to clearly signal it needs overriding by subclasses that perform evaluation.
    • Docstring should state its purpose: process results (potentially from self._results or elsewhere) and return calculated metrics.

Wendong-Fan avatar May 05 '25 08:05 Wendong-Fan

Thank you @Wendong-Fan for the explanation — I think this direction makes perfect sense! If @sunchengxuanivy is interested, I’d be totally happy for @sunchengxuanivy to take it forward. Otherwise, I’d be glad to take it on myself, or work on it together. @sunchengxuanivy, feel free to let me know what you think!

boerz-coding avatar May 06 '25 05:05 boerz-coding

https://github.com/camel-ai/camel/pull/2293#discussion_r2083319389

to simplify message for html report generation.

sunchengxuanivy avatar May 12 '25 04:05 sunchengxuanivy