syne-tune
syne-tune copied to clipboard
Failed trials have out of date metrics
Hi, I'm using SageMaker as a backend and remote launcher. I noticed that if a job errors out during training, the latest performance logs will not be captured.
For example in my HPO experiment on CIFAR-10 dataset, One trial (number 8) had been reported in the Syne Tune results dataframe as achieving a validation accuracy of 0.8478
at epoch 22:
However my CloudWatch logs show that the validation accuracy actually reached 0.926
at epoch 60 before crashing:
Interestingly the job shows as Stopped rather than Failed in SageMaker console. Does Syne Tune notice an exception and stop the job before it exits with a failure?
When a trial is stopped by a scheduler such as ASHA, a stopping signal is sent to the trial. If some metrics are emitted before this signal is seen, those metrics are not communicated to the scheduler and not shown in summary table (however, they are taken into account when showing the best trial at the end).
The reason for not showing this result to schedulers is that most of them would not support acting on results seen after they emit a stopping decision.
One thing we could do is to report results in this table even after stopping decisions even though scheduler would not see those results.
Thanks for your explanation! I think in some cases it could help to report. For example in my case just some profiling printouts failed at the end of training for a few trials, but only after diving deeper into it did I realize that those ones actually performed the best.
Hello @austinmw , can we close this issue? I think that whenever a trial script fails, Syne Tune will not consider future results coming from there, this is probably a safe behaviour, right?
Sure, that makes sense to me, thanks.