garak probe: content compliance

Consider implementing https://msrc.microsoft.com/blog/2025/03/jailbreaking-is-mostly-simpler-than-you-think/

Apr 09 '25 08:04 leondz

This looks like a really worthwhile addition to the probe stack. Few questions on implementation:

Would this all be handed in a single probe, or per-task-type probes and detectors be needed? I.e. self-harm probe with self-harm detector
Could we have per-task-type probes but single compliance detector (i.e. LLM-based)?

Apr 18 '25 22:04 mrowebot

Would this all be handed in a single probe, or per-task-type probes and detectors be needed? I.e. self-harm probe with self-harm detector

Typically a probe has responsibility for all interaction with the target. Sometimes this might require information from detectors; we're calling these adaptive probes. We don't yet have a solid pattern for these, but there's an example in the topic module which uses tree search

Could we have per-task-type probes but single compliance detector (i.e. LLM-based)?

Yes! Exactly. It's coming. This is the way.

Apr 23 '25 06:04 leondz