PyRIT icon indicating copy to clipboard operation
PyRIT copied to clipboard

[Proof of Concept] Introducing AttackConfigurations

Open imranbohoran opened this issue 9 months ago • 2 comments

The scope of this PR

This PR has a single commit that attempts to demonstrate how the AttackConfiguration can be included and to get some feedback on the general approach. No tests have been added/updated at this point, and only the CreascendoOrchestrator and PromptSendingOrchestrator has been changed to help with the demonstration. These 2 orchestrators were chosen to give an end-to-end slice of the different variations of the attacks. This change doesn't try to address every aspect of the proposal and only looks at the introduction of the AttackConfiguration (i.e. removing orchestrator_identifier and labels is not included).

Changes

Following is a summary we've tried to capture of the main areas of change.

Changes to Memory models and operations

  • Memory models:
    • AttackConfigurationEntry
      • This is the new memory model for the attack configuration concept.
      • In addition to the fields proposed in the issue, the following fields were introduced:
        • attack_result: This is intended to capture the overall result of the attack. For multi-turn attacks, where there is an objective, this will indicate if the objective was achieved. For single-turn attacks, this will hold all the score values of the scorers used. Prior to this change, the result of multi-turn attacks are not persisted anywhere and having it stored as part of the AttackConfiguration gives us the ability to retrieve the result for reporting/off-line analysis purposes.
        • start_time and end_time: These were added to record the start and end times of an attack. This is something that is not available right now and by adding the AttackConfigurations gives us the possibility to capture the information in a sensible place.
    • PromptMemoryEntry
      • This now adds a new attack_configuration_id column. This has a foreign-key reference to the AttackConfigurations.
      • There is also a Mapped object for the AttackConfiguration that is lazily loaded. Given that this is a single object and is useful to view on the context of the PromptEntry, the mapping was created with a lazy configuration of "joined"
    • Memory Interface
      • Convenience functions to add and update the AttackConfigurations. These utilise the existing _insert_entry and _update_entries functions.

Model changes:

  • AttackConfiguration
    • Represents the application model of the AttackConfigurationEntry
  • PromptRequestPiece
    • Includes a AttackConfiguration reflect the model changes
  • SeedPrompt
    • Includes an AttackConfiguration that is used to pass relevant attack configuration per prompt. Given that single-turn attacks can have multiple prompts provided as part of the orchestrator, and given that these prompts can represent varying objectives, including the AttackConfiguration at the SeedPrompt level seemed a reasonable way to carry the AttackConfiguration when creating PromptEntries.

Orchestrator changes

  • MultiTurn Orchestrator
    • The run_attacks_async function wraps the execution of the run_attack_async (through the limited_run_attack function) with an AttackConfiguration creation and update step. This then allows all multi-turn orchestrators to be able to create and update the AttackConfiguration without changing their underlying implementation.
    • However, this poses an issue with the run_attack_async function that now expects an AttackConfiguraiton to be passed in. As an exposed API, I wouldn't think that's something a user should have to provide. Is there a need to have 2 functions to run attacks for multi-turn orchestrators or is this an opportunity only expose one (i.e. making run_attack_async the protected abstract function and making run_attacks_async the function to run attacks). Our observation has been that the run_attack_async function is only called within the MultiTurnOrchestrator, some tests and docs.
  • SingleTurn Orchestrator
    • Given that multiple prompts can be sent to this orchestrator, each prompt is considered an attack with its own objective.
    • An AttackConfiguration is created in the send_prompt_async function and is used when creating a NormalizerRequest (to include it in the SeedPrompt).
    • The AttackConfiguration is then passed through the SeedPrompt for when prompts are created (including system prompts)
    • The update to AttackConfiguration (to add the attack_result and end_time), is done in the send_normalizer_requests_async function (which seemed to be only used in the PromptSendingOrchestrator and some docs).
    • The reason for the creation and update to happen in 2 functions was due to the fact that this orchestrator can have multiple prompts been provided and the prompts are passed as a list to the upstream functions.

Scorer changes

  • Includes the AttackConfiguration in functions where prompts are created for scoring with an llm.

Normalizer changes

  • Accepts the AttackConfiguration when creating a NormalizerRequest in order populate the SeedPrompt.

Running 2 scripts using the changes in this PR results in the following data (screenshots of the table provided, if useful, we can share the duckdb file as well). The examples runs crescendo with 2 objectives and a PromptSendingOrchestrator with 2 prompts.

AttackConfiguration

Screenshot 2025-03-06 at 19 49 19

PromptEntries

Screenshot 2025-03-06 at 19 49 48

imranbohoran avatar Mar 07 '25 09:03 imranbohoran

Thanks for all of your comments. We've enjoyed going through each of them (me and a couple of my colleagues have been looking at it as a collective).

To perhaps restate the usecase that we considered from this change, is to find a way to perform an analysis of attacks after they've been executed. And thus looking to access all prompts that participated in an Attack along with their objective (a trace of all prompts against all targets in the attack). This includes prompts for the objective, system prompts and prompts for scoring. With this in mind, we felt capturing the overall result of an attack is useful. And hence captured it in the AttackConfiguration. Thinking though this more, inspired by your comments so far, it does feel like the result of an Attack should probably be stored separately. We've responded to the comments in-line with this thinking.

Considering @rlundeen2's new proposal, we are unsure how this will address the usecase given that there are are many conversations that take place during an attack. We've put our thoughts for this in the relevant comments as well.

We really appreciate your engagement on this with us and appreciate that this is a relatively big task for an external contribution. Therefore we are happy to have a back-and-forth on the design and any nit-picking as well. If there's a benefit of a synchronous design discussion, we are happy to join that as well (if that's something you think is possible). Or if fleshing out the initial design of this is more suitable in discord, we are happy to do that as well.

imranbohoran avatar Mar 10 '25 15:03 imranbohoran

Thanks @imranbohoran ! IMO it wouldn't hurt to talk this through over a call. I'll check with @rlundeen2 tomorrow on that. Feel free to email me at my GH username (at) microsoft (dot) com

romanlutz avatar Mar 10 '25 19:03 romanlutz

This was a great way to kickstart the effort! As mentioned in the corresponding issue our thinking about this has evolved a bit and @rlundeen2 and @bashirpartovi took over the larger scale refactoring along with a general orchestrator refactor. The largest bit so far is #945 . If you have any feedback please comment there. Thanks again for the great suggestions.

romanlutz avatar Jun 06 '25 23:06 romanlutz