syne-tune icon indicating copy to clipboard operation
syne-tune copied to clipboard

Gracefully deal with SageMaker Failures

Open wistuba opened this issue 2 years ago • 4 comments

A SageMaker training job failed for some random reasons which seems to break the tuner:

File "/opt/conda/lib/python3.8/site-packages/syne_tune/tuner.py", line 152, in run
    new_done_trial_statuses, new_results = self._process_new_results(
  File "/opt/conda/lib/python3.8/site-packages/syne_tune/tuner.py", line 282, in _process_new_results
    done_trials_statuses = self._update_running_trials(trial_status_dict, new_results, callbacks=self.callbacks)
  File "/opt/conda/lib/python3.8/site-packages/syne_tune/tuner.py", line 437, in _update_running_trials
    assert trial_id in self.last_seen_result_per_trial, \
AssertionError: trial 35 completed and no metrics got observed

Would be great to retry jobs or at least ignore and continue somehow.

wistuba avatar May 13 '22 21:05 wistuba

We throw an error in this case because we expect a trial to publish at least one metric when run fully.

I believe it is better to fail in this case rather than continue spending resources given that this is almost certainly an indication of a user error and also not an error we can easily recover from.

geoalgo avatar May 23 '22 12:05 geoalgo

Regarding, retrying the job, I think it would be a loss of resource. At least if the trial is deterministic, then the same result should happen again (the trial ran and no metric was in the output).

One thing we could do though in this case is to show the trial log (as we do when an error occur in the user-script) then what is wrong should be clear for a user.

geoalgo avatar May 23 '22 12:05 geoalgo

To be clear here, the job did not fail because of a bug in the code but because of SageMaker. It did not even start executing the code, i.e. no logs in CloudWatch. Or at least this is my understanding.

wistuba avatar May 24 '22 07:05 wistuba

Ok thanks I see. Do you have any additional detail on the job that did not produce logs? (was it not started because of an internal SageMaker error for instance?)

geoalgo avatar May 24 '22 08:05 geoalgo

Closing as cannot be reproduced easily and we did not get the log.

geoalgo avatar Aug 31 '22 09:08 geoalgo