garak icon indicating copy to clipboard operation
garak copied to clipboard

Changed Attempt.outputs to return all assistant outputs

Open mrowebot opened this issue 9 months ago • 8 comments

This PR fixes #1127.

Changed

  • attempts.Attempt.outputs to return all assistant turns outputs, not just the final turn output.

Added

  • last_output property to attempts.Attempt to return the last assistant turn output.
  • Unit test cases for probes.atkgen.Tox and detectors.unsafe_content.ToxicCommentModel to validate that outputs and detector_results cardinalities match.

Verification

  • [ ] Run the tests and ensure they pass python -m pytest tests/
  • [ ] python -m garak --model_type huggingface --model_name gpt2 --probes atkgen.Tox --detectors unsafe_content.ToxicCommentModel

The above CMD will write the probe attempts including detector results to the garak*.report.jsonl file and will contain entries where the "entry_type": "attempt" entry will contain outputs and detector_results with the same cardinality, should any detection results have been produced.

mrowebot avatar Apr 17 '25 21:04 mrowebot

The attempt.outputs method is currently in flux as turn an conversation support is coming soon in main. Until a week ago detectors used attempt.all_outputs() and as of land for #943 with language translation options they all now rely on attempt.outputs_for(lang) to get the list of outputs a detector should perform evaluation on, these methods are currently what are expected to have a 1:1 match in to the list of detector_results[detector_name] however I do see that Evaluator.evaluate() relies on attempt.outputs() which likely indicates #1127 is a unique issue. The rework of turn/conversation will need likely incorporate testing to validate detector result count to output expectations are enforced.

Note that the atkgen probe's custom behavior and mutation of Attempt is likely the true root of this divergence as that probe emits Attempts that to not conform to normal detector expectations.

I suspect this change is not quite what is needed however I need to do some more testing to be sure.

jmartin-tech avatar Apr 17 '25 23:04 jmartin-tech

Sounds like this PR needs to be held until attempts.output is clearer, as this could be an edge case involving atkgen

mrowebot avatar Apr 18 '25 06:04 mrowebot

General refactor is upcoming in #1089 which is planned for the coming fortnight

leondz avatar Apr 18 '25 06:04 leondz

@leondz park this then until refactor and potentially delete? Feels like it might be a redundant change.

mrowebot avatar Apr 18 '25 22:04 mrowebot

The Turn & Conversation functionality won't hit release until end May, possibly end June. Depending on review, if we can land this earlier than that, I'm happy

leondz avatar Apr 19 '25 06:04 leondz

@leondz sorry if I was unclear, the root cause of the issue seems to be in atkgen's custom mutation of it's attempts not matching the expectations of the detector the attempts are passed to.

While attempt.outputs() is only called in Evaluator.evaluate at this time, I do not believe a global changing in behavior of this method is the desired result as this impacts all evaluation processing. This PR is set as draft for now and I will take a closer look in the next few days at possible follow on impacts before shifting it back to ready for review state.

jmartin-tech avatar Apr 19 '25 18:04 jmartin-tech

@jmartin-tech let me know if any changes are needed here and happy to update the branch/PR.

mrowebot avatar Apr 19 '25 20:04 mrowebot

This highlighted an unclear behaviour in atkgen and whether use of all_outputs or outputs is preferred when doing detection on an attempt. Atkgen expects that the detector will be run over all outputs at any conversational turn.

leondz avatar Apr 23 '25 15:04 leondz

#1254 is going to alter the structure of Attempt quite a lot, as well as providing explicit abstractions for conversational turns, messages and message history within prompts, and multi-exchange conversations. The distinction emerging from here between using all vs. most-recent outputs is integrated into the spec for that broader change. Thank you very much!

leondz avatar Jun 27 '25 13:06 leondz