camel icon indicating copy to clipboard operation
camel copied to clipboard

feat: Integrating ST-WebAgentBenchmark #3037

Open right1wrong opened this issue 3 months ago • 3 comments

Description

Fixes #3037

  • Implement STWebAgentBenchmark class inheriting from BaseBenchmark
  • Add STWebAgentBenchConfig for configuration management
  • Add STWebAgentTask and STWebAgentResult data models
  • Support for 6 policy dimensions: user_consent, boundary, strict_execution, hierarchy, robustness, error_handling
  • Integration with ChatAgent and Workforce
  • Proper implementation of abstract methods (download, load, run)
  • Add exports to benchmarks init.py
  • Following CAMEL coding patterns and documentation style

The benchmark evaluates web agents on safety and trustworthiness in realistic enterprise scenarios.

Features

  • Evaluates web agents on safety and trustworthiness in realistic enterprise scenarios
  • Supports parallel execution for performance
  • Comprehensive metrics including CR (Completion Rate), CuP (Completion under Policy), and Risk Ratios
  • Compatible with existing CAMEL agent infrastructure

Testing

  • [x] Import tests pass
  • [x] Basic benchmark creation works
  • [x] Configuration validation works
  • [x] Follows established CAMEL patterns

Notes The actual ST-WebAgentBench environment dependencies are optional and will be installed when users need the full functionality.

Checklist

Go over all the following points, and put an x in all the boxes that apply.

  • [x] I have read the CONTRIBUTION guide (required)
  • [x] I have linked this PR to an issue using the Development section on the right sidebar or by adding Fixes #issue-number in the PR description (required)
  • [ ] I have checked if any dependencies need to be added or updated in pyproject.toml and uv lock
  • [x] I have updated the tests accordingly (required for a bug fix or a new feature)
  • [ ] I have updated the documentation if needed:
  • [ ] I have added examples if this is a new feature

If you are unsure about any of these, don't hesitate to ask. We are here to help!

right1wrong avatar Sep 10 '25 02:09 right1wrong

[!IMPORTANT]

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • [ ] Create PR with unit tests
  • [ ] Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Sep 10 '25 02:09 coderabbitai[bot]

hey @right1wrong , just checking in on this PR since it hasn't been updated in a while. Please let us know if there's anything we can do to help

Wendong-Fan avatar Sep 29 '25 06:09 Wendong-Fan

Hey @right1wrong,

Hope everything is going well, feel free to ping a comment with my name to let us know when to review!

waleedalzarooni avatar Nov 14 '25 15:11 waleedalzarooni

Can you add example, you can refer this https://github.com/camel-ai/camel/blob/master/examples/benchmarks/ragbench.py

Hi Saedbhati! Could you share where to add example? I have changed all the doc strings to the format r"""...""", although I'm not sure it's necessary since r"""...""" is ususally used for texts containing backslashes.

right1wrong avatar Nov 16 '25 04:11 right1wrong

Can you add example, you can refer this https://github.com/camel-ai/camel/blob/master/examples/benchmarks/ragbench.py

Hi Saedbhati! Could you share where to add example? I have changed all the doc strings to the format r"""...""", although I'm not sure it's necessary since r"""...""" is ususally used for texts containing backslashes.

hi @right1wrong sorry for delayed reply,you can add an example in camel/examples/benchmarks refer to https://github.com/camel-ai/camel/blob/master/examples/benchmarks/ragbench.py

fengju0213 avatar Nov 26 '25 06:11 fengju0213