PyRIT icon indicating copy to clipboard operation
PyRIT copied to clipboard

FEAT: Anthropic Model-Written Evals Dataset

Open 0xm00n opened this issue 1 month ago • 6 comments

Description

This PR adds support for the Anthropic model-written-evals dataset to PyRIT. The model-written-evals dataset contains 154 evaluation datasets designed to test LLM behaviors across 4 main categories: persona traits, sycophancy, advanced AI risks, and gender bias. The evaluations use language models to automatically generate test cases across multiple behavioral dimensions.

Dataset: https://github.com/anthropics/evals

Associated Paper: https://arxiv.org/abs/2212.09251

Work Completed

  • Implemented the fetch_anthropic_evals_dataset() function in anthropic_evals_dataset.py
  • Added unit tests in test_anthropic_evals_dataset.py (12 test cases)
  • Added integration test in test_fetch_datasets.py
  • Updated API documentation in api.rst
  • Registered function in pyrit/datasets/__init__.py

Related Issue

Contributes to issue #450

0xm00n avatar Nov 05 '25 18:11 0xm00n

Thanks for adding these! made a few stylistic suggestions!

hannahwestra25 avatar Nov 07 '25 16:11 hannahwestra25

@0xm00n btw I just saw Anthropic drop a new dataset: https://www.anthropic.com/news/political-even-handedness

Any chance you're interested in contributing a similar fetcher for it as well? It has two prompts per row so it would require a tiny bit of custom handling, but otherwise (more?) straightforward than this one.

romanlutz avatar Nov 14 '25 14:11 romanlutz

@0xm00n btw I just saw Anthropic drop a new dataset: https://www.anthropic.com/news/political-even-handedness

Any chance you're interested in contributing a similar fetcher for it as well? It has two prompts per row so it would require a tiny bit of custom handling, but otherwise (more?) straightforward than this one.

Yup, seems pretty easy. will work on it soon

0xm00n avatar Nov 15 '25 00:11 0xm00n

@romanlutz @AdrGav941 I've refactored into the QA structure in the latest commit. let me know :)

0xm00n avatar Nov 21 '25 21:11 0xm00n

The latest iteration looks good to me! If the added unit tests pass this should be good. Thanks for refactoring @0xm00n !

AdrGav941 avatar Nov 21 '25 22:11 AdrGav941

Yes, all tests pass!

0xm00n avatar Nov 21 '25 22:11 0xm00n

@romanlutz following up here so we can close the PR

0xm00n avatar Dec 03 '25 00:12 0xm00n

@romanlutz refactored for conflicts given the new v0.10.0! should save you some work :)

0xm00n avatar Dec 19 '25 01:12 0xm00n