[Feature] Support Resume Mechanism for Interrupted Inference Tasks
Describe the feature
Problem Description Currently, when OpenCompass performs large-scale model inference (infer), if a task is interrupted unexpectedly (e.g., due to resource failures, manual termination, etc.), it requires a full restart and cannot resume from the interruption point. This leads to:
Loss of completed inference results, causing resource waste through redundant computations
Poor fault tolerance for long-running tasks (e.g., full evaluation of billion-parameter models)
Feature Request Add a resume mechanism for infer to achieve:
✅ Automatic progress recording: Periodically persist completed sample IDs/indices to a checkpoint file during runtime
✅ Intelligent resumption: Automatically detect checkpoint files upon restart and skip completed samples
✅ Progress visualization: Clearly display [Resumed] identifiers and remaining task volume in logs (e.g., Progress: 1200/5000 (resumed))
✅ Flexible control: Explicitly enable resumption via CLI parameters (e.g., --resume) to prevent accidental overwrites
Will you implement it?
- [ ] I would like to implement this feature and create a PR!
Thanks for your suggestions. We currently support resume, you can use -r latest to restart the inference