Checkpointing Evaluation Results and Enable Resume of Evaluation
Hi, I wonder if harness is planning to support checkpointing the evaluation results of evaluation tasks during the evaluation process, so that if some error happens during the evaluation on one benchmark (e.g., end of GPU allocation walltime), the evaluation process can resume from (nearly) where it stops. Thanks!
Hi! you can --use_cache <DIR> to cache the results while evaluating and skip previously evaluated samples on resumption. Caching is rank-dependent though, so restart with the same GPU count if interrupted!
Thanks for the prompt response! A quick follow-up: how about the evaluation via a local inference server? can it be resumed?
yeah that also implements caching!
Here is a bug: if the inference is all done, but the evaluation after it fails (in my case, "Resource punkt_tab not found"), and after fixing it you run lm-eval again to resume the evaluation, it will show this error:
2024-09-03:20:51:09,639 INFO [model.py:283] Cached requests: 541, Requests remaining: 0
Traceback (most recent call last):
File ".../conda-envs/llava/bin/lm_eval", line 8, in <module>
sys.exit(cli_evaluate())
File ".../lm-evaluation-harness/lm_eval/__main__.py", line 382, in cli_evaluate
results = evaluator.simple_evaluate(
File ".../lm-evaluation-harness/lm_eval/utils.py", line 398, in _wrapper
return fn(*args, **kwargs)
File ".../lm-evaluation-harness/lm_eval/evaluator.py", line 296, in simple_evaluate
results = evaluate(
File ".../lm-evaluation-harness/lm_eval/utils.py", line 398, in _wrapper
return fn(*args, **kwargs)
File ".../lm-evaluation-harness/lm_eval/evaluator.py", line 468, in evaluate
resps = getattr(lm, reqtype)(cloned_reqs)
File ".../lm-evaluation-harness/lm_eval/api/model.py", line 287, in fn
rem_res = getattr(self.lm, attr)(remaining_reqs)
File ".../lm-evaluation-harness/lm_eval/models/api_models.py", line 522, in generate_until
requests, all_gen_kwargs = zip(*(req.args for req in requests))
ValueError: not enough values to unpack (expected 2, got 0)
It seems that 0 samples to run will cause an error, and since the cache cannot be modified manually, I have to run the evaluation all over again. Please fix this problem, thank you a lot!
It seems that 0 samples to run will cause an error, and since the cache cannot be modified manually, I have to run the evaluation all over again. Please fix this problem, thank you a lot!
Hi! Are you using the latest commit? This error should be fixed after #2187
Oh, thank you very much for reminding! I am using an earlier version, and I will update to the new commit. Sorry for the interruption and thank you again for the effort!