feat: Integrating ST-WebAgentBenchmark #3037
Description
Fixes #3037
- Implement STWebAgentBenchmark class inheriting from BaseBenchmark
- Add STWebAgentBenchConfig for configuration management
- Add STWebAgentTask and STWebAgentResult data models
- Support for 6 policy dimensions: user_consent, boundary, strict_execution, hierarchy, robustness, error_handling
- Integration with ChatAgent and Workforce
- Proper implementation of abstract methods (download, load, run)
- Add exports to benchmarks init.py
- Following CAMEL coding patterns and documentation style
The benchmark evaluates web agents on safety and trustworthiness in realistic enterprise scenarios.
Features
- Evaluates web agents on safety and trustworthiness in realistic enterprise scenarios
- Supports parallel execution for performance
- Comprehensive metrics including CR (Completion Rate), CuP (Completion under Policy), and Risk Ratios
- Compatible with existing CAMEL agent infrastructure
Testing
- [x] Import tests pass
- [x] Basic benchmark creation works
- [x] Configuration validation works
- [x] Follows established CAMEL patterns
Notes The actual ST-WebAgentBench environment dependencies are optional and will be installed when users need the full functionality.
Checklist
Go over all the following points, and put an x in all the boxes that apply.
- [x] I have read the CONTRIBUTION guide (required)
- [x] I have linked this PR to an issue using the Development section on the right sidebar or by adding
Fixes #issue-numberin the PR description (required) - [ ] I have checked if any dependencies need to be added or updated in
pyproject.tomlanduv lock - [x] I have updated the tests accordingly (required for a bug fix or a new feature)
- [ ] I have updated the documentation if needed:
- [ ] I have added examples if this is a new feature
If you are unsure about any of these, don't hesitate to ask. We are here to help!
[!IMPORTANT]
Review skipped
Auto reviews are disabled on this repository.
Please check the settings in the CodeRabbit UI or the
.coderabbit.yamlfile in this repository. To trigger a single review, invoke the@coderabbitai reviewcommand.You can disable this status message by setting the
reviews.review_statustofalsein the CodeRabbit configuration file.
✨ Finishing touches
🧪 Generate unit tests (beta)
- [ ] Create PR with unit tests
- [ ] Post copyable unit tests in a comment
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.
hey @right1wrong , just checking in on this PR since it hasn't been updated in a while. Please let us know if there's anything we can do to help
Hey @right1wrong,
Hope everything is going well, feel free to ping a comment with my name to let us know when to review!
Can you add example, you can refer this https://github.com/camel-ai/camel/blob/master/examples/benchmarks/ragbench.py
Hi Saedbhati! Could you share where to add example? I have changed all the doc strings to the format r"""...""", although I'm not sure it's necessary since r"""...""" is ususally used for texts containing backslashes.
Can you add example, you can refer this https://github.com/camel-ai/camel/blob/master/examples/benchmarks/ragbench.py
Hi Saedbhati! Could you share where to add example? I have changed all the doc strings to the format r"""...""", although I'm not sure it's necessary since r"""...""" is ususally used for texts containing backslashes.
hi @right1wrong sorry for delayed reply,you can add an example in camel/examples/benchmarks refer to https://github.com/camel-ai/camel/blob/master/examples/benchmarks/ragbench.py