lab feature request: More robust error recovery

Hi there!

The FAQ (latest version says):

Some runs failed. How can I rerun them?

If the failed runs were never started, for example, due to grid node failures, you can simply run the “start” experiment step again. It will skip all runs that have already been started. Afterwards, run “fetch” and make reports as usual. Lab detects which runs have already been started by checking if the driver.log file exists. So if you have failed runs that were already started, but you want to rerun them anyway, go to their run directories, remove the driver.log files and then run the “start” experiment step again as above.

It would be nice to have the option that restarting an experiment is idempotent. That is to automatize that restarting a failed run protects the integrity of the experiment without the manual deletion of files like driver.log. That would be useful when using the lab in a computing infrastructure where jobs could be preempted to run another task with higher priority. (This is typical in cases where many other tasks are training jobs that are idempotent).

If that were not convenient as the default behaviour, perhaps this behaviour could be enabled by some additional option.

I understand a potential issue is that some runs can just keep failing, so perhaps reaching idempotence is more subtle, but it'd be a great feature.

/cc @matgreco @alvaro-torralba

Dec 16 '22 17:12 hectorpal

Hey @hectorpal, thanks for the suggestion! I'm not sure how to improve the status quo on this, however. It's easy to detect runs that have not been started. But as you say, it's tricky to check whether a run was successful. I don't see a general way of doing so. Do you? The main problem is that we need to count running out of time or memory as a successful run. Those are "expected errors" so to say.

Dec 19 '22 06:12 jendrikseipp

The problem we were having was if a job was interrupted due to external reasons (e.g. the cluster preempting the job). I was thinking that perhaps the run.py script could do something at the very end, after closing the run.log and run.err files, write some extra property for example of 'job_finished'. That way it could be easy to check which jobs were terminated in the "normal" way or when run.py was killed externally. But I'm not sure if this works under the current way of setting the time and memory limits.

Dec 19 '22 08:12 alvaro-torralba

I agree with Alvaro about this general idea: at the very end, no matter what happens with the run, idempotency might be achieved by creating a file marking the run is done. File creation is atomic. If it failed because of memory, it should be possible to create a file –'job_finished'– before liberating the parallel worker. In this setting, it's possible to lose work if a run ends before 'job_finished' is created. That's the price to pay for idempotency.

How?

Idea 1

I wonder if the right place is the end of wait() here: https://github.com/aibasel/lab/blob/dfa67faa867e0835c871c082c6e0128bb976e7cf/lab/calls/call.py#L190-L213

That's waiting for the return of Popen call. It should return something no matter what happens with the process. One idea would be to create the job_finished file right before returning. Perhaps that's not enough, as the return value is to be stored later. So here is another idea

Idea 2

job_finished is created at the highest level point right before switching to the next run. This is compatible with the run producing a plan or failing because of bounded memory, time, or even an execution failure.

Dec 19 '22 18:12 hectorpal

Your second proposal sounds like it could work. I'll think more about this after the break.

Dec 23 '22 15:12 jendrikseipp

Good!

I was wondering about race conditions when using multiple CPU cores.

I guess the iteration over runs is centralized so there isn't much to coordinate. Otherwise, I was wondering if a lock of the directory is necessary or used per run. If that's happening, it would interact with the idea I proposed.

Happy holidays!

Dec 23 '22 17:12 hectorpal

lab lab copied to clipboard

feature request: More robust error recovery

Some runs failed. How can I rerun them?

Idea 1

Idea 2

lab
lab copied to clipboard