neps icon indicating copy to clipboard operation
neps copied to clipboard

Check restarting/handling of pending config when resuming a run

Open Neeratyoy opened this issue 1 year ago • 3 comments

For potential reproducibility of the observed issue:

  • Running Random Search for 20 (max_evaluations_total) evaluations distributed across 4 workers
  • Midway through the run, killed a worker and restarted the worker soon enough
  • The overall run ran fine but noticed certain anomalies, as described below,
  1. The process termination halted a config, for example, config ID 16
  2. On restarting, the 4 workers proceeded fine without errors but an extra config ID 21 was generated while config ID 16 was not re-evaluated or completed and remains pending forever

Some more observations:

  • For max_evaluations_total=20 we should have config IDs from 1-20 with each of them having their own result.yaml
  • Only config_16 does not have result.yaml whereas config_21 does
  • If I now re-run a worker as max_evaluations_total=21, it now satisfies that extra config required by sampling a new config config_22

Should a new worker, re-evaluate pending configs, as priority? Also with this issue or under this scenario the generated config IDs range from [1, n+1] if max_evaluations_total=n.

Neeratyoy avatar Nov 13 '23 17:11 Neeratyoy

This happens when the process is force-killed during the evaluation of a config, and is reproducible with a single process.

To reproduce:

  1. Choose an algorithm which have very low overhead: e.g Random Search
  2. Write a run_pipeline(...) function which takes a relatively long time compared to the algorithm overhead: e.g time.sleep(10)
  3. Run neps.api.run. Arguments don't matter this should reproduce
  4. If the logs are observed terminate the process once the algorithm enters the evaluation phase with the log Start evaluating config .... Otherwise, refine the steps 1 and 2 to increase your chance of terminating during evaluation.
  5. If after termination there is a config with a missing result.yaml file, you have successfully interrupted an evaluation.
  6. Re-run the process to see the effect described.

Alternatively, You can skip the steps 1-5, and manually delete a result.yaml file from any config folder to make NePs think that, there is a pending config some mysterious other process is handling right now.

karibbov avatar Nov 21 '23 08:11 karibbov

There's been some developments here:

  1. If there is a configuration (Trial) marked as pending, the next available worker will pick it up instead of sampling a new configuration. https://github.com/automl/neps/blob/08f30aef0a58f55339c4f5337d4ebd333206d621/neps/runtime.py#L169-L187

  2. Killing a worker mid-evaluation is interesting. Right now, if a configuration errors and the worker can process it that the configuration evaluation crashed, it will be marked as such and there will be a result.yaml for it, indicating the configuration crashed. This will not be re-attempted.

However, what should happen if you Ctrl+c a worker who is currently evaluating a configuration? This is not a fault of the configuration and so it should probably be re-evaluated. The current behaviour of this is that the configuration will be forever in the EVALUATING state.

Fixing this is non-trivial, although there's some patchwork to make this less bad.

a) When a ctrl+c happens, the worker immediately kills the configuration evaluation and it's one task to complete before ending is to tell NePSState that it is no longer EVALAUTING and set it back to PENDING, such that the config can be picked up again. There is no chance for saving a checkpoint here and resuming from this partial state. It would also add a lot of complication to support anything like that. Maybe in the future this can be re-evaluated.

I'll implement the ctrl+c handler and consider this issue done as far as we can do for now.

eddiebergman avatar Aug 05 '24 17:08 eddiebergman

I did an implementation in #129 which should be robust enough to common occurences, CTRL+C as-well as SLURM, which sends process signals. The only exception is SIGKILL which really is just a hard-kill and there's no way around this. Most default actions however do not send a SIGKILL

eddiebergman avatar Aug 05 '24 18:08 eddiebergman