garak icon indicating copy to clipboard operation
garak copied to clipboard

probe: content compliance

Open leondz opened this issue 9 months ago • 2 comments

Consider implementing https://msrc.microsoft.com/blog/2025/03/jailbreaking-is-mostly-simpler-than-you-think/

leondz avatar Apr 09 '25 08:04 leondz

This looks like a really worthwhile addition to the probe stack. Few questions on implementation:

  • Would this all be handed in a single probe, or per-task-type probes and detectors be needed? I.e. self-harm probe with self-harm detector
  • Could we have per-task-type probes but single compliance detector (i.e. LLM-based)?

mrowebot avatar Apr 18 '25 22:04 mrowebot

Would this all be handed in a single probe, or per-task-type probes and detectors be needed? I.e. self-harm probe with self-harm detector

Typically a probe has responsibility for all interaction with the target. Sometimes this might require information from detectors; we're calling these adaptive probes. We don't yet have a solid pattern for these, but there's an example in the topic module which uses tree search

Could we have per-task-type probes but single compliance detector (i.e. LLM-based)?

Yes! Exactly. It's coming. This is the way.

leondz avatar Apr 23 '25 06:04 leondz