SWE-bench
SWE-bench copied to clipboard
logs are unusable with multiple test instances
harness/run_evaluation.py takes a --log_dir argument but if there are multiple predictions for a single model and test in the --predictions_path file they all write to a single file in the --log_dir and clobber each other.
@JasonGross Hi, I face the same problem. Would you mind sharing your solution for this problem here? Really Appreciate!
I appended a unique number to the name of the model, e.g., gpt-4-uid1, gpt-4-uid2, etc., so that the program thinks each instance was generated by a different "model"
Thanks for pointing this out @JasonGross @skzhang1 along with the proposed solution.
I agree that this is a bit inconvenient. The solution by @JasonGross would definitely work.
For convenience, in commit ef1d5f, I have added a log_suffix argument. You can provide a string that gets appended to the end of the log's name:
{instance_id}.{suffix}.logfor validation{instance_id}.{model}.{suffix}.eval.logfor evaluation
Hope this helps! Happy to address any follow ups, and feel free to close this issue if it resolves the topic.
How does --log_suffix help? It doesn't seem much different than log_dir in terms of functionality, since it seems that --log_suffix can only be specified once per call to harness/run_evaluation.py.
The whole problem is that a single call to harness/run_evaluation.py has one log file per (instance id, model) pair, whereas we should actually get one log file per (instance id, model, test instance).
@JasonGross I understand what you are saying. I think it is fine if users would like to use the work-around you suggested, but I don't plan to support this auto-incrementing in the repository.
The evaluation was written assuming that a single evaluation run would be given a --predictions_path where there is exactly 1 prediction per task instance. Every run would then generate 1 execution log per task instance, where the log's naming follows the <instance ID>.<model>.eval.log convention. Calculating metrics is more straightforward this way imo.
If you want to do a "pass@k" style evaluation, where model inference is run "k" times for the same task instance, I would recommend storing the predictions from k runs across k different .jsonl files, and then running evaluation on each .jsonl file; you can then use --log_suffix to distinguish between the runs.