Feature/persuasion jailbreak probe
This PR (addressing #683 ) adds a new probe implementing Persuasive Adversarial Prompts (PAP) from the paper "Persuasive Adversarial Prompts". This probe tests whether LLMs can resist jailbreak attempts that use social science-based persuasion techniques such as Authority Endorsement, Logical Appeal, and Priming among others.
The probe includes 6 static prompts extracted from successful examples in the paper, covering various harmful request categories (illegal activity, malware, misinformation, adult content, phishing, eating disorders).
I have currently classified the severity as OF_CONCERN given the potential to generate sensitive content if the jailbreak succeeds.
Verification
List the steps needed to make sure this thing works
- [x] Verify probe can be loaded
python -m garak --list_probes | grep persuasion - [x]
python -m garak -t test -p persuasion.PersuasivePAP - [x] Run the probe tests and ensure they all pass
pytest tests/probes/test_probes_persuasion.py - [x] Run all probe tests and ensure they pass
pytest tests/probes/ - [x] Verify the probe runs and successfully generates report with an LLM
garak -t huggingface -n meta-llama/Llama-2-7b-chat-hf -p persuasion.PersuasivePAP. This step requires configuring huggingface
DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅
I have read the DCO Document and I hereby sign the DCO
recheck
Will change to draft while I investigate and address the test failures
This needs more of the PAP work implemented before acceptance
@leondz Thank you for the feedback, and my apologies for the misunderstanding. I see now that the full Broad Scan dataset and iterative probing implementation are available on Huggingface/the paper's repo. I'll get started on updating the implementation as suggested