garak Feature/persuasion jailbreak probe

This PR (addressing #683 ) adds a new probe implementing Persuasive Adversarial Prompts (PAP) from the paper "Persuasive Adversarial Prompts". This probe tests whether LLMs can resist jailbreak attempts that use social science-based persuasion techniques such as Authority Endorsement, Logical Appeal, and Priming among others.

The probe includes 6 static prompts extracted from successful examples in the paper, covering various harmful request categories (illegal activity, malware, misinformation, adult content, phishing, eating disorders).

I have currently classified the severity as OF_CONCERN given the potential to generate sensitive content if the jailbreak succeeds.

Verification

List the steps needed to make sure this thing works

[x] Verify probe can be loaded python -m garak --list_probes | grep persuasion
[x] python -m garak -t test -p persuasion.PersuasivePAP
[x] Run the probe tests and ensure they all pass pytest tests/probes/test_probes_persuasion.py
[x] Run all probe tests and ensure they pass pytest tests/probes/
[x] Verify the probe runs and successfully generates report with an LLM garak -t huggingface -n meta-llama/Llama-2-7b-chat-hf -p persuasion.PersuasivePAP. This step requires configuring huggingface

Nov 05 '25 17:11 asaadkhaja99

DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅

Nov 05 '25 17:11 github-actions[bot]

I have read the DCO Document and I hereby sign the DCO

Nov 05 '25 17:11 asaadkhaja99

recheck

Nov 05 '25 17:11 asaadkhaja99

Will change to draft while I investigate and address the test failures

Nov 07 '25 01:11 asaadkhaja99

This needs more of the PAP work implemented before acceptance

@leondz Thank you for the feedback, and my apologies for the misunderstanding. I see now that the full Broad Scan dataset and iterative probing implementation are available on Huggingface/the paper's repo. I'll get started on updating the implementation as suggested

Nov 12 '25 14:11 asaadkhaja99