SMAC3
SMAC3 copied to clipboard
Resume previous configuration on crash/stop
Hello together, I'm wondering whether it is natively supported by SMAC for HPO to continue evaluating a configuration that crashed in the previous run.
I found the link to restore the state helpful as a first step. However that is not enough since I want to continue running the exact configuration that crashed in the previous run since I can continue model training with checkpoints. It seems to me that smbo.py supports running the incumbent again but setting the intensifier stage to IntensifierStage.PROCESS_INCUMBENT_RUN
does not resolve the issue, since _get_inc_available_inst
(intensification.py) returns None
and SMAC samples a new configuration instead resuming with the previous configuration.
Is there a fix for this? Also, an example in the documentation might be helpful. Some models are expensive to train and stopping/resuming model training within a SMAC trial should be suppoerted.
Background: I'm training a model and returning its performance in the tae_runner _call_ta
. The model training might crash or might be stopped manually to be resumed later.
Hi, sorry for the late response. For your first question: how to let SMAC continue the exact configuration after resuming. Unfortunately I have no concrete idea, this line of code: https://github.com/automl/SMAC3/blob/d4cb7ed76e0fbdd9edf6ab5360ff75de67ac2195/smac/intensification/intensification.py#L678
ensures that the return value from _get_inc_available_inst
should not return None. Maybe you could check your instances set?
If you don't have instance, you could consider manually adding that crashed configuration as initial_design:
https://github.com/automl/SMAC3/blob/master/smac/initial_design/initial_design.py#L33
For your second question, you could consider to ask tae
to create a new directory that maps your current run info to the new directory (for instance, /cfg_id/intance_id/random_seed), if the check point from previous run is found there, then your tae
should be able to continue the previous run.
You can implement a part in the target function that checks for checkpoints of your model such that you can continue training from there. SMAC itself is able to resume the last run.