SMAC3 icon indicating copy to clipboard operation
SMAC3 copied to clipboard

Resume previous configuration on crash/stop

Open BugsBuggy opened this issue 3 years ago • 1 comments

Hello together, I'm wondering whether it is natively supported by SMAC for HPO to continue evaluating a configuration that crashed in the previous run.

I found the link to restore the state helpful as a first step. However that is not enough since I want to continue running the exact configuration that crashed in the previous run since I can continue model training with checkpoints. It seems to me that smbo.py supports running the incumbent again but setting the intensifier stage to IntensifierStage.PROCESS_INCUMBENT_RUN does not resolve the issue, since _get_inc_available_inst (intensification.py) returns None and SMAC samples a new configuration instead resuming with the previous configuration.

Is there a fix for this? Also, an example in the documentation might be helpful. Some models are expensive to train and stopping/resuming model training within a SMAC trial should be suppoerted.

Background: I'm training a model and returning its performance in the tae_runner _call_ta. The model training might crash or might be stopped manually to be resumed later.

BugsBuggy avatar Oct 29 '21 15:10 BugsBuggy

Hi, sorry for the late response. For your first question: how to let SMAC continue the exact configuration after resuming. Unfortunately I have no concrete idea, this line of code: https://github.com/automl/SMAC3/blob/d4cb7ed76e0fbdd9edf6ab5360ff75de67ac2195/smac/intensification/intensification.py#L678

ensures that the return value from _get_inc_available_inst should not return None. Maybe you could check your instances set? If you don't have instance, you could consider manually adding that crashed configuration as initial_design: https://github.com/automl/SMAC3/blob/master/smac/initial_design/initial_design.py#L33

For your second question, you could consider to ask tae to create a new directory that maps your current run info to the new directory (for instance, /cfg_id/intance_id/random_seed), if the check point from previous run is found there, then your tae should be able to continue the previous run.

dengdifan avatar Nov 11 '21 09:11 dengdifan

You can implement a part in the target function that checks for checkpoints of your model such that you can continue training from there. SMAC itself is able to resume the last run.

alexandertornede avatar Mar 30 '23 08:03 alexandertornede