syne-tune icon indicating copy to clipboard operation
syne-tune copied to clipboard

Failed trials have out of date metrics

Open austinmw opened this issue 2 years ago • 3 comments

Hi, I'm using SageMaker as a backend and remote launcher. I noticed that if a job errors out during training, the latest performance logs will not be captured.

For example in my HPO experiment on CIFAR-10 dataset, One trial (number 8) had been reported in the Syne Tune results dataframe as achieving a validation accuracy of 0.8478 at epoch 22:

image

However my CloudWatch logs show that the validation accuracy actually reached 0.926 at epoch 60 before crashing:

image

image

Interestingly the job shows as Stopped rather than Failed in SageMaker console. Does Syne Tune notice an exception and stop the job before it exits with a failure?

austinmw avatar May 26 '22 00:05 austinmw

When a trial is stopped by a scheduler such as ASHA, a stopping signal is sent to the trial. If some metrics are emitted before this signal is seen, those metrics are not communicated to the scheduler and not shown in summary table (however, they are taken into account when showing the best trial at the end).

The reason for not showing this result to schedulers is that most of them would not support acting on results seen after they emit a stopping decision.

geoalgo avatar May 31 '22 09:05 geoalgo

One thing we could do is to report results in this table even after stopping decisions even though scheduler would not see those results.

geoalgo avatar May 31 '22 09:05 geoalgo

Thanks for your explanation! I think in some cases it could help to report. For example in my case just some profiling printouts failed at the end of training for a few trials, but only after diving deeper into it did I realize that those ones actually performed the best.

austinmw avatar May 31 '22 12:05 austinmw

Hello @austinmw , can we close this issue? I think that whenever a trial script fails, Syne Tune will not consider future results coming from there, this is probably a safe behaviour, right?

mseeger avatar Aug 25 '22 09:08 mseeger

Sure, that makes sense to me, thanks.

austinmw avatar Sep 20 '22 13:09 austinmw