Changed Attempt.outputs to return all assistant outputs
This PR fixes #1127.
Changed
-
attempts.Attempt.outputsto return all assistant turns outputs, not just the final turn output.
Added
-
last_outputproperty toattempts.Attemptto return the last assistant turn output. - Unit test cases for
probes.atkgen.Toxanddetectors.unsafe_content.ToxicCommentModelto validate that outputs and detector_results cardinalities match.
Verification
- [ ] Run the tests and ensure they pass
python -m pytest tests/ - [ ]
python -m garak --model_type huggingface --model_name gpt2 --probes atkgen.Tox --detectors unsafe_content.ToxicCommentModel
The above CMD will write the probe attempts including detector results to the garak*.report.jsonl file and will contain entries where the "entry_type": "attempt" entry will contain outputs and detector_results with the same cardinality, should any detection results have been produced.
The attempt.outputs method is currently in flux as turn an conversation support is coming soon in main. Until a week ago detectors used attempt.all_outputs() and as of land for #943 with language translation options they all now rely on attempt.outputs_for(lang) to get the list of outputs a detector should perform evaluation on, these methods are currently what are expected to have a 1:1 match in to the list of detector_results[detector_name] however I do see that Evaluator.evaluate() relies on attempt.outputs() which likely indicates #1127 is a unique issue. The rework of turn/conversation will need likely incorporate testing to validate detector result count to output expectations are enforced.
Note that the atkgen probe's custom behavior and mutation of Attempt is likely the true root of this divergence as that probe emits Attempts that to not conform to normal detector expectations.
I suspect this change is not quite what is needed however I need to do some more testing to be sure.
Sounds like this PR needs to be held until attempts.output is clearer, as this could be an edge case involving atkgen
General refactor is upcoming in #1089 which is planned for the coming fortnight
@leondz park this then until refactor and potentially delete? Feels like it might be a redundant change.
The Turn & Conversation functionality won't hit release until end May, possibly end June. Depending on review, if we can land this earlier than that, I'm happy
@leondz sorry if I was unclear, the root cause of the issue seems to be in atkgen's custom mutation of it's attempts not matching the expectations of the detector the attempts are passed to.
While attempt.outputs() is only called in Evaluator.evaluate at this time, I do not believe a global changing in behavior of this method is the desired result as this impacts all evaluation processing. This PR is set as draft for now and I will take a closer look in the next few days at possible follow on impacts before shifting it back to ready for review state.
@jmartin-tech let me know if any changes are needed here and happy to update the branch/PR.
This highlighted an unclear behaviour in atkgen and whether use of all_outputs or outputs is preferred when doing detection on an attempt. Atkgen expects that the detector will be run over all outputs at any conversational turn.
#1254 is going to alter the structure of Attempt quite a lot, as well as providing explicit abstractions for conversational turns, messages and message history within prompts, and multi-exchange conversations. The distinction emerging from here between using all vs. most-recent outputs is integrated into the spec for that broader change. Thank you very much!