prompt-injection-defenses
prompt-injection-defenses copied to clipboard
JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks
https://arxiv.org/pdf/2312.10766
we propose JailGuard, a universal detection framework for jailbreaking and hijacking attacks across LLMs and MLLMs. JailGuard operates on the principle that attacks are inherently less robust than benign ones, regardless of method or modality. Specifically, JailGuard mutates untrusted inputs to generate variants and leverages discrepancy of the variants’ responses on the model to distinguish attack samples from benign samples