probe: content compliance
Consider implementing https://msrc.microsoft.com/blog/2025/03/jailbreaking-is-mostly-simpler-than-you-think/
This looks like a really worthwhile addition to the probe stack. Few questions on implementation:
- Would this all be handed in a single probe, or per-task-type probes and detectors be needed? I.e. self-harm probe with self-harm detector
- Could we have per-task-type probes but single compliance detector (i.e. LLM-based)?
Would this all be handed in a single probe, or per-task-type probes and detectors be needed? I.e. self-harm probe with self-harm detector
Typically a probe has responsibility for all interaction with the target. Sometimes this might require information from detectors; we're calling these adaptive probes. We don't yet have a solid pattern for these, but there's an example in the topic module which uses tree search
Could we have per-task-type probes but single compliance detector (i.e. LLM-based)?
Yes! Exactly. It's coming. This is the way.