garak icon indicating copy to clipboard operation
garak copied to clipboard

probe: doctor attack

Open leondz opened this issue 8 months ago • 2 comments

Implementation of doctor / puppetry attack from https://hiddenlayer.com/innovation-hub/novel-universal-bypass-for-all-major-llms/

Also adds module for encoding funcs req'd by more than one plugin

NB this highlights missing intents & techniques functionality

Verification

  • [ ] garak -m test -p doctor,encoding.InjectLeet

leondz avatar Apr 25 '25 09:04 leondz

I feel like this could be included under dan. It's very DAN-like in its approach and IMO, not worthy of a distinct probe category.

erickgalinkin avatar Apr 29 '25 21:04 erickgalinkin

I think I disagree with the DAN idea. Of we continue this reasoning a little, Dan really should go in jailbreaks, and jailbreaks is not a meaningful category (too vague). We know our focus is on techniques - this looks like a role-playing attack, and of those, and specific coherent subgroup.

If one wants to get reporting on all role play type attacks, we have taxonomies and grouping in report_digest to enable that.

I could get behind a rename from doctor to house.

leondz avatar Apr 30 '25 05:04 leondz

I think I disagree with the DAN idea. Of we continue this reasoning a little, Dan really should go in jailbreaks, and jailbreaks is not a meaningful category (too vague). We know our focus is on techniques - this looks like a role-playing attack, and of those, and specific coherent subgroup.

If one wants to get reporting on all role play type attacks, we have taxonomies and grouping in report_digest to enable that.

I could get behind a rename from doctor to house.

Hi, similar to this -- I'm interested in submitting another broadly-successful roleplay prompt, but based on comments here, I can't tell if the project is interested in that type of PR (due to perceived noise or overlap or other).

What is the latest guidance on where these "jailbreak" or "role playing" prompts should live?

Intuitively, it seems like a dimension that would be reported across the model(s) scanned -- ie, the model correctly responded to a probe under normal conditions, but responded with prohibited content only when the context was prefixed with XYZ roleplay prompt.

I thought of this as a "buff" use case, but I'm probably misunderstanding the philosophy. Thanks!

cktlco avatar May 27 '25 06:05 cktlco

@erickgalinkin

I feel like this could be included under dan. It's very DAN-like in its approach and IMO, not worthy of a distinct probe category.

@cktlco

What is the latest guidance on where these "jailbreak" or "role playing" prompts should live?

Intuitively, it seems like a dimension that would be reported across the model(s) scanned -- ie, the model correctly responded to a probe under normal conditions, but responded with prohibited content only when the context was prefixed with XYZ roleplay prompt.

Unpacking this:

  1. what's a "probe category"? Well,,,,
  2. we have a code structure where modules group probe classes by "theme"
  3. there are a variety of taxonomatata that group individual probe classes (eg. OWASP Top10 LLM 2023)
  4. there is a typology of intents
  5. there are a few categorisations of strategy/technique/tactic

Thinking out "loud":

  • a. Future things are going to parameterise (3) & (4) and make use of them really flexible. We almost certainly don't want to arrange our code by both of them, that's got to be dynamic.
  • b. If we structure our code, (1), by any of the other dimensions, we lose a dimension of expressiveness
  • c. Reporting organised by code doesn't make a huge amount of sense to me - I think human consumers are more interested in techniques & intents, 3. and 4., than
  • d. It might be intuitive to name code after technique, but this creates ties that bind us quite awkwardly, I think:
    • i. The way we conceive of technique categories has changed over time and probably will continue to do so. Making class names match the categorisation-of-the-day means churn in names & name paths, creating extra for us (writing fixers) and making configs rot faster
    • ii. We use filesystems to store code, which generally support only DAG-based paths. However, this is too constrained for the technique description, because a probe class can use/implement >1 technique. While symlinks do us all the wonderful courtesy of simply existing, and git supports them natively, Windows also exists, and anyway, this gives us a matching & resolution problem (if a file with the same name can be accessed via two different directories/techniques, is it the same file, just implementing both techniques, or two separate files?)
  • e. We probably want to move away from using the code structure in reporting. Reporting post-Technique & Intent is, I think, going to make this really clear. Doing so reduces the impact & constraint of file organisation choices, uh, meaning fewer discussions like this, I hope
  • f. I don't have a strong concept of what justifies a distinct new file/module in probes. Implementing (e.) means we can skip addressing the task of defining this, for which I suspect there are many "OK" answers, perhaps a global optima, but no answers that everybody loves. I prefer it slightly scrappy, so we don't debate it - and also something other than (3) or (4), so that we don't lose expressiveness.

The above is why I prefer decoupling naming code structure (i.e. detector & probe module and class names) from how we formally conceive of probes.

post-script: Adding a technique taxonomy has been on the backburner for a while - this & grandma should be grouped there, without us having to move files around. grandma's naming is super similar and vaguely intuitive - otoh, a whole module for a roleplayed character feels unsustainable, agree.

Could we merge as-is and expect that what we learn Technique & Intent lends good clarity over this, perhaps even insights into (f)?

leondz avatar Jun 27 '25 12:06 leondz

@cktlco

I thought of this as a "buff" use case, but I'm probably misunderstanding the philosophy. Thanks!

No, that's OK, this is reasonable and a result of how probes in garak currently combine both "how we make the target fail" (technique) and the "what we make the target do" (intent). Intent is to separate these two things clearly out.

In the future buffs have a good chance of mostly focusing on data augmentation, which has a bit of overlap with technique, but shouldn't do the same thing.

None of this was clear when LLM Security first started as a field, hence older structures overlapping these concepts a bit.

leondz avatar Jun 27 '25 12:06 leondz

I'm ok with this merging this. We can figure out taxonomy and stuff as we move forward.

erickgalinkin avatar Jul 02 '25 15:07 erickgalinkin