FEAT Content harm scenario

Open hannahwestra25 opened this issue 1 month ago • 1 comments

Description

Add content harm scenario which provides a general set of attacks for each harm category. The idea is to have a quick scenario to run a comprehensive set of harms before drilling down into more specific harms. The scenario uses the PromptSending (baseline), RolePlay, ManyShot and RedTeam attacks to provide this summary using a set of objectives (user defined or provided in the datasets/seed_prompts/harms folder) to achieve this.

Tests and Documentation

Added content harm notebook plus instructions for dataset naming. Added unit tests

Nov 06 '25 17:11 hannahwestra25

Overall this is good! It'll be really nice to have solid examples here :)

My biggest feedback is that I think we should define exactly what we want out of this scenario. Here is what I think it is. "Can I get a vibe of this objective_target in a couple hours based on how it does on these harm categories".

And if we keep that strategy, we want to do the best we can to answer that question, and the strategies themselves should be baked in as much as possible. Along these lines, I'd recommend:

Simplify the strategies. I suspect most users just want to run "all" to get a vibe check, or to run specific harm categories. And if there is a strategy they want but it takes a long time (like crescendo) maybe we should split that off into a seperate longer-running scenario class.
Choose the attacks to do with those strategies explicitly (which converters and attacks to use). E.g. we can get the objectives from memory, and then this scenario can decide how we send those. I wouldn't make this configurable, because it adds another dimension to things.

Nov 08 '25 04:11 rlundeen2