FEAT: Add anecdoctor orchestrator to build attack prompts from real-world examples.

Open migdaepp opened this issue 8 months ago • 0 comments

Description

Adding a new Orchestrator that constructs attack prompts based on real-world examples. This orchestrator performs best for informational harms in that center on a consistent narrative (i.e. via clustering of a larger attack dataset). There are two options:

baseline approach simply constructs prompts from few-shot examples
use_knowledge_graph=True uses a processing target to construct a knowledge graph. This produces attack prompts that are more robust to cultural and linguistic differences in example data.

This contribution follows on collaborations with @eugeniavkim.

Tests and Documentation

Tests are included for the orchestrator. We also include documentation (jupytext) with a toy dataset as well as a link to real-world data for which the method works well. A paper validating the method is in revision and will be posted to ArXiv in the near future.

May 08 '25 17:05 migdaepp