FEAT Content harm scenario
Description
Add content harm scenario which provides a general set of attacks for each harm category. The idea is to have a quick scenario to run a comprehensive set of harms before drilling down into more specific harms. The scenario uses the PromptSending (baseline), RolePlay, ManyShot and RedTeam attacks to provide this summary using a set of objectives (user defined or provided in the datasets/seed_prompts/harms folder) to achieve this.
Tests and Documentation
Added content harm notebook plus instructions for dataset naming. Added unit tests
Overall this is good! It'll be really nice to have solid examples here :)
My biggest feedback is that I think we should define exactly what we want out of this scenario. Here is what I think it is. "Can I get a vibe of this objective_target in a couple hours based on how it does on these harm categories".
And if we keep that strategy, we want to do the best we can to answer that question, and the strategies themselves should be baked in as much as possible. Along these lines, I'd recommend:
- Simplify the strategies. I suspect most users just want to run "all" to get a vibe check, or to run specific harm categories. And if there is a strategy they want but it takes a long time (like crescendo) maybe we should split that off into a seperate longer-running scenario class.
- Choose the attacks to do with those strategies explicitly (which converters and attacks to use). E.g. we can get the objectives from memory, and then this scenario can decide how we send those. I wouldn't make this configurable, because it adds another dimension to things.