FEAT: Anthropic Model-Written Evals Dataset
Description
This PR adds support for the Anthropic model-written-evals dataset to PyRIT. The model-written-evals dataset contains 154 evaluation datasets designed to test LLM behaviors across 4 main categories: persona traits, sycophancy, advanced AI risks, and gender bias. The evaluations use language models to automatically generate test cases across multiple behavioral dimensions.
Dataset: https://github.com/anthropics/evals
Associated Paper: https://arxiv.org/abs/2212.09251
Work Completed
- Implemented the
fetch_anthropic_evals_dataset()function inanthropic_evals_dataset.py - Added unit tests in
test_anthropic_evals_dataset.py(12 test cases) - Added integration test in
test_fetch_datasets.py - Updated API documentation in
api.rst - Registered function in
pyrit/datasets/__init__.py
Related Issue
Contributes to issue #450
Thanks for adding these! made a few stylistic suggestions!
@0xm00n btw I just saw Anthropic drop a new dataset: https://www.anthropic.com/news/political-even-handedness
Any chance you're interested in contributing a similar fetcher for it as well? It has two prompts per row so it would require a tiny bit of custom handling, but otherwise (more?) straightforward than this one.
@0xm00n btw I just saw Anthropic drop a new dataset: https://www.anthropic.com/news/political-even-handedness
Any chance you're interested in contributing a similar fetcher for it as well? It has two prompts per row so it would require a tiny bit of custom handling, but otherwise (more?) straightforward than this one.
Yup, seems pretty easy. will work on it soon
@romanlutz @AdrGav941 I've refactored into the QA structure in the latest commit. let me know :)
The latest iteration looks good to me! If the added unit tests pass this should be good. Thanks for refactoring @0xm00n !
Yes, all tests pass!
@romanlutz following up here so we can close the PR
@romanlutz refactored for conflicts given the new v0.10.0! should save you some work :)