[Feature] Support Resume Mechanism for Interrupted Inference Tasks

Open ShikangPang opened this issue 5 months ago • 1 comments

Describe the feature

Problem Description Currently, when OpenCompass performs large-scale model inference (infer), if a task is interrupted unexpectedly (e.g., due to resource failures, manual termination, etc.), it requires a full restart and cannot resume from the interruption point. This leads to:

Loss of completed inference results, causing resource waste through redundant computations

Poor fault tolerance for long-running tasks (e.g., full evaluation of billion-parameter models)

Feature Request Add a resume mechanism for infer to achieve:

✅ Automatic progress recording: Periodically persist completed sample IDs/indices to a checkpoint file during runtime

✅ Intelligent resumption: Automatically detect checkpoint files upon restart and skip completed samples

✅ Progress visualization: Clearly display [Resumed] identifiers and remaining task volume in logs (e.g., Progress: 1200/5000 (resumed))

✅ Flexible control: Explicitly enable resumption via CLI parameters (e.g., --resume) to prevent accidental overwrites

Will you implement it?

[ ] I would like to implement this feature and create a PR!

Jun 26 '25 02:06 ShikangPang

Thanks for your suggestions. We currently support resume, you can use -r latest to restart the inference

Jun 26 '25 05:06 tonysy