garak Changed Attempt.outputs to return all assistant outputs

This PR fixes #1127.

Changed

attempts.Attempt.outputs to return all assistant turns outputs, not just the final turn output.

Added

last_output property to attempts.Attempt to return the last assistant turn output.
Unit test cases for probes.atkgen.Tox and detectors.unsafe_content.ToxicCommentModel to validate that outputs and detector_results cardinalities match.

Verification

[ ] Run the tests and ensure they pass python -m pytest tests/
[ ] python -m garak --model_type huggingface --model_name gpt2 --probes atkgen.Tox --detectors unsafe_content.ToxicCommentModel

The above CMD will write the probe attempts including detector results to the garak*.report.jsonl file and will contain entries where the "entry_type": "attempt" entry will contain outputs and detector_results with the same cardinality, should any detection results have been produced.

Apr 17 '25 21:04 mrowebot

The attempt.outputs method is currently in flux as turn an conversation support is coming soon in main. Until a week ago detectors used attempt.all_outputs() and as of land for #943 with language translation options they all now rely on attempt.outputs_for(lang) to get the list of outputs a detector should perform evaluation on, these methods are currently what are expected to have a 1:1 match in to the list of detector_results[detector_name] however I do see that Evaluator.evaluate() relies on attempt.outputs() which likely indicates #1127 is a unique issue. The rework of turn/conversation will need likely incorporate testing to validate detector result count to output expectations are enforced.

Note that the atkgen probe's custom behavior and mutation of Attempt is likely the true root of this divergence as that probe emits Attempts that to not conform to normal detector expectations.

I suspect this change is not quite what is needed however I need to do some more testing to be sure.

Apr 17 '25 23:04 jmartin-tech

Sounds like this PR needs to be held until attempts.output is clearer, as this could be an edge case involving atkgen

Apr 18 '25 06:04 mrowebot

General refactor is upcoming in #1089 which is planned for the coming fortnight

Apr 18 '25 06:04 leondz

@leondz park this then until refactor and potentially delete? Feels like it might be a redundant change.

Apr 18 '25 22:04 mrowebot

The Turn & Conversation functionality won't hit release until end May, possibly end June. Depending on review, if we can land this earlier than that, I'm happy

Apr 19 '25 06:04 leondz

@leondz sorry if I was unclear, the root cause of the issue seems to be in atkgen's custom mutation of it's attempts not matching the expectations of the detector the attempts are passed to.

While attempt.outputs() is only called in Evaluator.evaluate at this time, I do not believe a global changing in behavior of this method is the desired result as this impacts all evaluation processing. This PR is set as draft for now and I will take a closer look in the next few days at possible follow on impacts before shifting it back to ready for review state.

Apr 19 '25 18:04 jmartin-tech

@jmartin-tech let me know if any changes are needed here and happy to update the branch/PR.

Apr 19 '25 20:04 mrowebot

This highlighted an unclear behaviour in atkgen and whether use of all_outputs or outputs is preferred when doing detection on an attempt. Atkgen expects that the detector will be run over all outputs at any conversational turn.

Apr 23 '25 15:04 leondz

#1254 is going to alter the structure of Attempt quite a lot, as well as providing explicit abstractions for conversational turns, messages and message history within prompts, and multi-exchange conversations. The distinction emerging from here between using all vs. most-recent outputs is integrated into the spec for that broader change. Thank you very much!

Jun 27 '25 13:06 leondz