garak icon indicating copy to clipboard operation
garak copied to clipboard

Feature/persuasion jailbreak probe

Open asaadkhaja99 opened this issue 2 months ago • 5 comments

This PR (addressing #683 ) adds a new probe implementing Persuasive Adversarial Prompts (PAP) from the paper "Persuasive Adversarial Prompts". This probe tests whether LLMs can resist jailbreak attempts that use social science-based persuasion techniques such as Authority Endorsement, Logical Appeal, and Priming among others.

The probe includes 6 static prompts extracted from successful examples in the paper, covering various harmful request categories (illegal activity, malware, misinformation, adult content, phishing, eating disorders).

I have currently classified the severity as OF_CONCERN given the potential to generate sensitive content if the jailbreak succeeds.

Verification

List the steps needed to make sure this thing works

  • [x] Verify probe can be loaded python -m garak --list_probes | grep persuasion
  • [x] python -m garak -t test -p persuasion.PersuasivePAP
  • [x] Run the probe tests and ensure they all pass pytest tests/probes/test_probes_persuasion.py
  • [x] Run all probe tests and ensure they pass pytest tests/probes/
  • [x] Verify the probe runs and successfully generates report with an LLM garak -t huggingface -n meta-llama/Llama-2-7b-chat-hf -p persuasion.PersuasivePAP. This step requires configuring huggingface
Screenshot 2025-11-06 at 1 42 45 AM

asaadkhaja99 avatar Nov 05 '25 17:11 asaadkhaja99

DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅

github-actions[bot] avatar Nov 05 '25 17:11 github-actions[bot]

I have read the DCO Document and I hereby sign the DCO

asaadkhaja99 avatar Nov 05 '25 17:11 asaadkhaja99

recheck

asaadkhaja99 avatar Nov 05 '25 17:11 asaadkhaja99

Will change to draft while I investigate and address the test failures

asaadkhaja99 avatar Nov 07 '25 01:11 asaadkhaja99

This needs more of the PAP work implemented before acceptance

@leondz Thank you for the feedback, and my apologies for the misunderstanding. I see now that the full Broad Scan dataset and iterative probing implementation are available on Huggingface/the paper's repo. I'll get started on updating the implementation as suggested

asaadkhaja99 avatar Nov 12 '25 14:11 asaadkhaja99