[Proof of Concept] Introducing AttackConfigurations
The scope of this PR
This PR has a single commit that attempts to demonstrate how the AttackConfiguration can be included and to get some feedback on the general approach. No tests have been added/updated at this point, and only the CreascendoOrchestrator and PromptSendingOrchestrator has been changed to help with the demonstration. These 2 orchestrators were chosen to give an end-to-end slice of the different variations of the attacks. This change doesn't try to address every aspect of the proposal and only looks at the introduction of the AttackConfiguration (i.e. removing orchestrator_identifier and labels is not included).
Changes
Following is a summary we've tried to capture of the main areas of change.
Changes to Memory models and operations
- Memory models:
- AttackConfigurationEntry
- This is the new memory model for the attack configuration concept.
- In addition to the fields proposed in the issue, the following fields were introduced:
attack_result: This is intended to capture the overall result of the attack. For multi-turn attacks, where there is an objective, this will indicate if the objective was achieved. For single-turn attacks, this will hold all the score values of the scorers used. Prior to this change, the result of multi-turn attacks are not persisted anywhere and having it stored as part of the AttackConfiguration gives us the ability to retrieve the result for reporting/off-line analysis purposes.start_timeandend_time: These were added to record the start and end times of an attack. This is something that is not available right now and by adding the AttackConfigurations gives us the possibility to capture the information in a sensible place.
- PromptMemoryEntry
- This now adds a new
attack_configuration_idcolumn. This has a foreign-key reference to the AttackConfigurations. - There is also a Mapped object for the AttackConfiguration that is lazily loaded. Given that this is a single object and is useful to view on the context of the PromptEntry, the mapping was created with a lazy configuration of "joined"
- This now adds a new
- Memory Interface
- Convenience functions to add and update the AttackConfigurations. These utilise the existing
_insert_entryand_update_entriesfunctions.
- Convenience functions to add and update the AttackConfigurations. These utilise the existing
- AttackConfigurationEntry
Model changes:
- AttackConfiguration
- Represents the application model of the AttackConfigurationEntry
- PromptRequestPiece
- Includes a AttackConfiguration reflect the model changes
- SeedPrompt
- Includes an AttackConfiguration that is used to pass relevant attack configuration per prompt. Given that single-turn attacks can have multiple prompts provided as part of the orchestrator, and given that these prompts can represent varying objectives, including the AttackConfiguration at the
SeedPromptlevel seemed a reasonable way to carry the AttackConfiguration when creating PromptEntries.
- Includes an AttackConfiguration that is used to pass relevant attack configuration per prompt. Given that single-turn attacks can have multiple prompts provided as part of the orchestrator, and given that these prompts can represent varying objectives, including the AttackConfiguration at the
Orchestrator changes
- MultiTurn Orchestrator
- The
run_attacks_asyncfunction wraps the execution of therun_attack_async(through thelimited_run_attackfunction) with an AttackConfiguration creation and update step. This then allows all multi-turn orchestrators to be able to create and update the AttackConfiguration without changing their underlying implementation. - However, this poses an issue with the
run_attack_asyncfunction that now expects anAttackConfiguraitonto be passed in. As an exposed API, I wouldn't think that's something a user should have to provide. Is there a need to have 2 functions to run attacks for multi-turn orchestrators or is this an opportunity only expose one (i.e. makingrun_attack_asyncthe protected abstract function and makingrun_attacks_asyncthe function to run attacks). Our observation has been that therun_attack_asyncfunction is only called within the MultiTurnOrchestrator, some tests and docs.
- The
- SingleTurn Orchestrator
- Given that multiple prompts can be sent to this orchestrator, each prompt is considered an attack with its own objective.
- An AttackConfiguration is created in the
send_prompt_asyncfunction and is used when creating aNormalizerRequest(to include it in the SeedPrompt). - The AttackConfiguration is then passed through the SeedPrompt for when prompts are created (including system prompts)
- The update to AttackConfiguration (to add the
attack_resultandend_time), is done in thesend_normalizer_requests_asyncfunction (which seemed to be only used in the PromptSendingOrchestrator and some docs). - The reason for the creation and update to happen in 2 functions was due to the fact that this orchestrator can have multiple prompts been provided and the prompts are passed as a list to the upstream functions.
Scorer changes
- Includes the AttackConfiguration in functions where prompts are created for scoring with an llm.
Normalizer changes
- Accepts the AttackConfiguration when creating a
NormalizerRequestin order populate theSeedPrompt.
Running 2 scripts using the changes in this PR results in the following data (screenshots of the table provided, if useful, we can share the duckdb file as well). The examples runs crescendo with 2 objectives and a PromptSendingOrchestrator with 2 prompts.
AttackConfiguration
PromptEntries
Thanks for all of your comments. We've enjoyed going through each of them (me and a couple of my colleagues have been looking at it as a collective).
To perhaps restate the usecase that we considered from this change, is to find a way to perform an analysis of attacks after they've been executed. And thus looking to access all prompts that participated in an Attack along with their objective (a trace of all prompts against all targets in the attack). This includes prompts for the objective, system prompts and prompts for scoring. With this in mind, we felt capturing the overall result of an attack is useful. And hence captured it in the AttackConfiguration. Thinking though this more, inspired by your comments so far, it does feel like the result of an Attack should probably be stored separately. We've responded to the comments in-line with this thinking.
Considering @rlundeen2's new proposal, we are unsure how this will address the usecase given that there are are many conversations that take place during an attack. We've put our thoughts for this in the relevant comments as well.
We really appreciate your engagement on this with us and appreciate that this is a relatively big task for an external contribution. Therefore we are happy to have a back-and-forth on the design and any nit-picking as well. If there's a benefit of a synchronous design discussion, we are happy to join that as well (if that's something you think is possible). Or if fleshing out the initial design of this is more suitable in discord, we are happy to do that as well.
Thanks @imranbohoran ! IMO it wouldn't hurt to talk this through over a call. I'll check with @rlundeen2 tomorrow on that. Feel free to email me at my GH username (at) microsoft (dot) com
This was a great way to kickstart the effort! As mentioned in the corresponding issue our thinking about this has evolved a bit and @rlundeen2 and @bashirpartovi took over the larger scale refactoring along with a general orchestrator refactor. The largest bit so far is #945 . If you have any feedback please comment there. Thanks again for the great suggestions.