prompt-injection-defenses icon indicating copy to clipboard operation
prompt-injection-defenses copied to clipboard

JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks

Open ramimac opened this issue 8 months ago • 0 comments

https://arxiv.org/pdf/2312.10766

we propose JailGuard, a universal detection framework for jailbreaking and hijacking attacks across LLMs and MLLMs. JailGuard operates on the principle that attacks are inherently less robust than benign ones, regardless of method or modality. Specifically, JailGuard mutates untrusted inputs to generate variants and leverages discrepancy of the variants’ responses on the model to distinguish attack samples from benign samples

ramimac avatar Jun 19 '24 12:06 ramimac