automlbenchmark icon indicating copy to clipboard operation
automlbenchmark copied to clipboard

Create a clear failure message when the instance is aborted due to CPU inactivity

Open PGijsbers opened this issue 3 years ago • 2 comments

I might be missing something here, but sometimes EC2 instances get aborted after showing CPU inactivity for an extended duration. In my case, failures.csv reads e.g.:

tunedrandomforest:2021Q3,openml/s/269,1h8c_gp3,Yolanda,0,1475731149,Aborting instance i-0d4171c302da6bfdb for job aws.openml_s_269.1h8c_gp3.Yolanda.0.TunedRandomForest.
tunedrandomforest:2021Q3,openml/s/269,1h8c_gp3,Yolanda,0,1475731149,"No result artifacts, either the benchmark failed to start, or the instance got killed: check the local logs to understand what happened on the instance."

when we read the log we get a clearer message: [WARNING] [amlb.runners.aws:20:14:34.188] WARN: Instance i-0d4171c302da6bfdb (aws.openml_s_269.1h8c_gp3.yolanda.0.tunedrandomforest) has no CPU activity in the last 30 minutes.

Is there a particular reason we could not have this clearer message in failures.csv directly? If not, it seems to me that the second message would be much more informative in failures.csv.

PGijsbers avatar Dec 06 '21 15:12 PGijsbers

The reason is only technical: proper error handling is one of the most difficult thing to get right the first time. In this case, the thread that monitors CPU and prints the log message is not the same as the one that runs the job and raises the error -> 1st loss of information the reason is lost. On top of this, there is a bug (or 2) in the fact that this is translated into a AWSError when it should probably be a JobError, although this can be discussed:

  • if translated into JobError, then there is no entry created in the results.csv, only an entry in failures.csv: this is usually reserved for errors that depends on instance creation issues (no instance available, …), job cancellations, or job that go (hopefully never) into an invalid state.
  • by translating it into an AWSError, 2 contradictory things happen: (1) we create an error entry in results.csv, (2) by not propagating the exception, we still try to download the results from s3, this fails and the original error is replaced with that one, leading to the "No result artifacts…" message.

To sum up, many issues/difficulties here:

  1. we need to decide clearly what should happen for this kind for hanging instance errors: results.csv entry or not? If entry, we don't retry it, and we consider that the hanging is almost always caused by the framework (e.g. out of memory). If no entry, we consider that the hanging may be due to the ec2 instance, then it goes only to failures.csv and we should probably retry those at least once.
  2. once we defined the behaviour:
  • transmit the error message from monitoring thread to job thread.
  • translate this to the correct error type.
  • if translating to AWSError then we should skip attempting to download results in this case, or do it in such a way that it can't override the original error.

sebhrusen avatar Dec 07 '21 19:12 sebhrusen

I think it gets even a little more difficult, since I feel that an instance which was interrupted before the end of the time budget should maybe be treated differently from one that (far) exceeded the time budget already. For our current setup we had already decided to retry all cases, so that's what I had already started doing. But for the future maybe we can write something to dredge through our logs and see how often (if at all) a process recovers from low CPU activity (i.e. it had 15 minutes of inactivity, but finished and/or showed activity before the 30 minute mark). That might make us help a more informed decision (since if experimentally we find processes never recover, then I think marking it in results.csv is not wrong).

PGijsbers avatar Dec 08 '21 09:12 PGijsbers