garak icon indicating copy to clipboard operation
garak copied to clipboard

feature: make llmaaj prompts switchable

Open leondz opened this issue 5 months ago • 2 comments

Existing LLMaaJ prompts can be defined in code. Currently a prompt defined in garak/resources/red_team/system_prompts.py is consumed in multiple places. Probes / detectors should have better control over the LLMaaJ prompt used. For example, the change in #1083 introduces a longer prompt; adjustments in prompt length affect which models are going to be able to process it.

This behaviour is more flexible if the LLMaaJ prompt(s) are treated to data, e.g. using the payload mechanism

Proposal:

  1. Introduce a payload of LLMaaJ prompts for rating items on 1-5 scale (i.e. all having the same output), containing the current and prior prompt in garak.resources.red_team.system_prompts.judge_system_prompt()
  2. Alter garak.resources.red_team.system_prompts.judge_system_prompt() to take an index into that payload as param, defaulting to the long version (this is a bit fragile, input welcome)
  3. Set garak.resource.tap to use the shorter, prior system prompt by default (for reproducibility)
  4. Move the other methods in garak.resources.red_team.system_prompts to use a similar paradigm for payloads

leondz avatar Jun 30 '25 09:06 leondz

Items to consider that may impact implementation.

The judge detector currently has undocumented support an for override of the judge system prompt by providing system_prompt_judge in configuration params. This could be revised to load this from a file in the data_path as well.

However in this configuration it is difficult to supply a prompt tailored to the goal of the attempt which is often context required from the probe evaluating detections. The default provided system prompt for deterctors.judge.ModelAsJudge uses format injection of a {goal} retrieved from a class level default or extracted from the probe at this time. Shifting to a goal provided by the attempt ties into other ideas around decomposition of prompts in attempts into intents & techniques for more pluggable, dynamic, and targeted prompt generation.

Further the red_team.system_prompts.judge_system_prompt() would often also need support for injecting a goal.

While the original proposal adds some flexibly index based selection may be problematic. Also consider this level of customization likely needs to be logged in report.jsonl to definitely document the prompt used when judging.

jmartin-tech avatar Jul 01 '25 20:07 jmartin-tech

This issue has been automatically marked as stale because it has not had recent activity. If you are still interested in this issue, please respond to keep it open. Thank you!

github-actions[bot] avatar Sep 30 '25 00:09 github-actions[bot]