verifiers icon indicating copy to clipboard operation
verifiers copied to clipboard

Add support for re-scoring evaluations

Open ob1-s opened this issue 2 months ago • 2 comments

Description

My attempt to implement #438. I ran into a few edge-cases I need to solve before this can be merged.

I successfully tested the re-score workflow with the following envs: gsm8k, wordle, simpleqa, continuation_quality and reverse_text.

But found some edge-cases in these:

  • tool_test: since the re-scoring logic doesn't deserialize the tool call objects from the original results, it fails to apply the rubrics;
  • acebench_agent_multistep: if i got this right, the re-scoring doesn't work here because this MultiTurnEnv relies on non-serializable state.

Any direction here would be very much appreciated.

Type of Change

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [x] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [ ] Documentation update
  • [ ] Test improvement

Testing

  • [x] All existing tests pass when running uv run pytest locally.
  • [ ] New tests have been added to cover the changes
  • [x] Relevant tests have been updated.

Checklist

  • [x] My code follows the style guidelines of this project as outlined in AGENTS.md
  • [x] I have performed a self-review of my own code
  • [x] I have commented my code, particularly in hard-to-understand areas
  • [x] I have made corresponding changes to the documentation
  • [x] My changes generate no new warnings
  • [ ] Any dependent changes have been merged and published [not relevant]

Additional Notes

ob1-s avatar Oct 18 '25 02:10 ob1-s

Is there any plan for this?

vodenkaj avatar Jul 30 '25 13:07 vodenkaj

Hey @vodenkaj! We haven't gotten to this yet, as we have other priorities at the moment.

FedericoBonel avatar Aug 01 '25 03:08 FedericoBonel