neps
neps copied to clipboard
Check restarting/handling of pending config when resuming a run
For potential reproducibility of the observed issue:
- Running Random Search for 20 (
max_evaluations_total
) evaluations distributed across 4 workers - Midway through the run, killed a worker and restarted the worker soon enough
- The overall run ran fine but noticed certain anomalies, as described below,
- The process termination halted a config, for example, config ID
16
- On restarting, the 4 workers proceeded fine without errors but an extra config ID
21
was generated while config ID16
was not re-evaluated or completed and remainspending
forever
Some more observations:
- For
max_evaluations_total=20
we should have config IDs from 1-20 with each of them having their ownresult.yaml
- Only
config_16
does not haveresult.yaml
whereasconfig_21
does - If I now re-run a worker as
max_evaluations_total=21
, it now satisfies that extra config required by sampling a new configconfig_22
Should a new worker, re-evaluate pending configs, as priority?
Also with this issue or under this scenario the generated config IDs range from [1, n+1]
if max_evaluations_total=n
.
This happens when the process is force-killed during the evaluation of a config, and is reproducible with a single process.
To reproduce:
- Choose an algorithm which have very low overhead: e.g
Random Search
- Write a
run_pipeline(...)
function which takes a relatively long time compared to the algorithm overhead: e.gtime.sleep(10)
- Run
neps.api.run
. Arguments don't matter this should reproduce - If the logs are observed terminate the process once the algorithm enters the evaluation phase with the log
Start evaluating config ...
. Otherwise, refine the steps 1 and 2 to increase your chance of terminating during evaluation. - If after termination there is a config with a missing
result.yaml
file, you have successfully interrupted an evaluation. - Re-run the process to see the effect described.
Alternatively, You can skip the steps 1-5, and manually delete a result.yaml
file from any config folder to make NePs think that, there is a pending config some mysterious other process is handling right now.
There's been some developments here:
-
If there is a configuration (
Trial
) marked as pending, the next available worker will pick it up instead of sampling a new configuration. https://github.com/automl/neps/blob/08f30aef0a58f55339c4f5337d4ebd333206d621/neps/runtime.py#L169-L187 -
Killing a worker mid-evaluation is interesting. Right now, if a configuration errors and the worker can process it that the configuration evaluation crashed, it will be marked as such and there will be a
result.yaml
for it, indicating the configuration crashed. This will not be re-attempted.
However, what should happen if you Ctrl+c
a worker who is currently evaluating a configuration? This is not a fault of the configuration and so it should probably be re-evaluated. The current behaviour of this is that the configuration will be forever in the EVALUATING
state.
Fixing this is non-trivial, although there's some patchwork to make this less bad.
a) When a ctrl+c
happens, the worker immediately kills the configuration evaluation and it's one task to complete before ending is to tell NePSState
that it is no longer EVALAUTING
and set it back to PENDING
, such that the config can be picked up again. There is no chance for saving a checkpoint here and resuming from this partial state. It would also add a lot of complication to support anything like that. Maybe in the future this can be re-evaluated.
I'll implement the ctrl+c
handler and consider this issue done as far as we can do for now.
I did an implementation in #129 which should be robust enough to common occurences, CTRL+C as-well as SLURM, which sends process signals. The only exception is SIGKILL
which really is just a hard-kill and there's no way around this. Most default actions however do not send a SIGKILL