lm-evaluation-harness Checkpointing Evaluation Results and Enable Resume of Evaluation

Hi, I wonder if harness is planning to support checkpointing the evaluation results of evaluation tasks during the evaluation process, so that if some error happens during the evaluation on one benchmark (e.g., end of GPU allocation walltime), the evaluation process can resume from (nearly) where it stops. Thanks!

Jul 26 '24 17:07 Zilinghan

Hi! you can --use_cache <DIR> to cache the results while evaluating and skip previously evaluated samples on resumption. Caching is rank-dependent though, so restart with the same GPU count if interrupted!

Jul 26 '24 17:07 baberabb

Thanks for the prompt response! A quick follow-up: how about the evaluation via a local inference server? can it be resumed?

Jul 26 '24 17:07 Zilinghan

yeah that also implements caching!

Jul 26 '24 17:07 baberabb

Here is a bug: if the inference is all done, but the evaluation after it fails (in my case, "Resource punkt_tab not found"), and after fixing it you run lm-eval again to resume the evaluation, it will show this error:

2024-09-03:20:51:09,639 INFO     [model.py:283] Cached requests: 541, Requests remaining: 0
Traceback (most recent call last):
  File ".../conda-envs/llava/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
  File ".../lm-evaluation-harness/lm_eval/__main__.py", line 382, in cli_evaluate
    results = evaluator.simple_evaluate(
  File ".../lm-evaluation-harness/lm_eval/utils.py", line 398, in _wrapper
    return fn(*args, **kwargs)
  File ".../lm-evaluation-harness/lm_eval/evaluator.py", line 296, in simple_evaluate
    results = evaluate(
  File ".../lm-evaluation-harness/lm_eval/utils.py", line 398, in _wrapper
    return fn(*args, **kwargs)
  File ".../lm-evaluation-harness/lm_eval/evaluator.py", line 468, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
  File ".../lm-evaluation-harness/lm_eval/api/model.py", line 287, in fn
    rem_res = getattr(self.lm, attr)(remaining_reqs)
  File ".../lm-evaluation-harness/lm_eval/models/api_models.py", line 522, in generate_until
    requests, all_gen_kwargs = zip(*(req.args for req in requests))
ValueError: not enough values to unpack (expected 2, got 0)

It seems that 0 samples to run will cause an error, and since the cache cannot be modified manually, I have to run the evaluation all over again. Please fix this problem, thank you a lot!

Sep 03 '24 12:09 RiverGao

It seems that 0 samples to run will cause an error, and since the cache cannot be modified manually, I have to run the evaluation all over again. Please fix this problem, thank you a lot!

Hi! Are you using the latest commit? This error should be fixed after #2187

Sep 03 '24 13:09 baberabb

Oh, thank you very much for reminding! I am using an earlier version, and I will update to the new commit. Sorry for the interruption and thank you again for the effort!

Sep 03 '24 13:09 RiverGao