SWE-bench logs are unusable with multiple test instances

trafficstars

harness/run_evaluation.py takes a --log_dir argument but if there are multiple predictions for a single model and test in the --predictions_path file they all write to a single file in the --log_dir and clobber each other.

Dec 28 '23 02:12 JasonGross

@JasonGross Hi, I face the same problem. Would you mind sharing your solution for this problem here? Really Appreciate!

Jan 04 '24 22:01 skzhang1

I appended a unique number to the name of the model, e.g., gpt-4-uid1, gpt-4-uid2, etc., so that the program thinks each instance was generated by a different "model"

Jan 04 '24 22:01 JasonGross

Thanks for pointing this out @JasonGross @skzhang1 along with the proposed solution.

I agree that this is a bit inconvenient. The solution by @JasonGross would definitely work.

For convenience, in commit ef1d5f, I have added a log_suffix argument. You can provide a string that gets appended to the end of the log's name:

{instance_id}.{suffix}.log for validation
{instance_id}.{model}.{suffix}.eval.log for evaluation

Hope this helps! Happy to address any follow ups, and feel free to close this issue if it resolves the topic.

Jan 15 '24 06:01 john-b-yang

How does --log_suffix help? It doesn't seem much different than log_dir in terms of functionality, since it seems that --log_suffix can only be specified once per call to harness/run_evaluation.py.

The whole problem is that a single call to harness/run_evaluation.py has one log file per (instance id, model) pair, whereas we should actually get one log file per (instance id, model, test instance).

Jan 15 '24 08:01 JasonGross

@JasonGross I understand what you are saying. I think it is fine if users would like to use the work-around you suggested, but I don't plan to support this auto-incrementing in the repository.

The evaluation was written assuming that a single evaluation run would be given a --predictions_path where there is exactly 1 prediction per task instance. Every run would then generate 1 execution log per task instance, where the log's naming follows the <instance ID>.<model>.eval.log convention. Calculating metrics is more straightforward this way imo.

If you want to do a "pass@k" style evaluation, where model inference is run "k" times for the same task instance, I would recommend storing the predictions from k runs across k different .jsonl files, and then running evaluation on each .jsonl file; you can then use --log_suffix to distinguish between the runs.

Apr 16 '24 05:04 john-b-yang

SWE-bench SWE-bench copied to clipboard

logs are unusable with multiple test instances

SWE-bench
SWE-bench copied to clipboard