syne-tune
syne-tune copied to clipboard
Gracefully deal with SageMaker Failures
A SageMaker training job failed for some random reasons which seems to break the tuner:
File "/opt/conda/lib/python3.8/site-packages/syne_tune/tuner.py", line 152, in run
new_done_trial_statuses, new_results = self._process_new_results(
File "/opt/conda/lib/python3.8/site-packages/syne_tune/tuner.py", line 282, in _process_new_results
done_trials_statuses = self._update_running_trials(trial_status_dict, new_results, callbacks=self.callbacks)
File "/opt/conda/lib/python3.8/site-packages/syne_tune/tuner.py", line 437, in _update_running_trials
assert trial_id in self.last_seen_result_per_trial, \
AssertionError: trial 35 completed and no metrics got observed
Would be great to retry jobs or at least ignore and continue somehow.
We throw an error in this case because we expect a trial to publish at least one metric when run fully.
I believe it is better to fail in this case rather than continue spending resources given that this is almost certainly an indication of a user error and also not an error we can easily recover from.
Regarding, retrying the job, I think it would be a loss of resource. At least if the trial is deterministic, then the same result should happen again (the trial ran and no metric was in the output).
One thing we could do though in this case is to show the trial log (as we do when an error occur in the user-script) then what is wrong should be clear for a user.
To be clear here, the job did not fail because of a bug in the code but because of SageMaker. It did not even start executing the code, i.e. no logs in CloudWatch. Or at least this is my understanding.
Ok thanks I see. Do you have any additional detail on the job that did not produce logs? (was it not started because of an internal SageMaker error for instance?)
Closing as cannot be reproduced easily and we did not get the log.