How is run status handled?

Open gpascale opened this issue 1 year ago • 1 comments

❓Question

It's extremely unclear to me how run status (active, finished, failed etc...) is determined - specifically whether a run is active. In my code, I'm calling report_successful_finish when my model has finished training and testing and I've uploaded the figures I want to, but I can't tell if this actually impacts the state? Most of my runs automatically transition to the finished state, but not always. Does this happen automatically when the process exits? When the run object is destroyed?

My dashboard is littered with week-old runs that still show as in progress. In some cases, maybe the processes crashed? I can't tell. I've tried using the CLI to "close" them with little success - usually it reports no errors but the run still shows as in progress.

I've searched extensively through the documentation but I hardly see anything about this.

Jun 25 '24 15:06 gpascale

Hey @gpascale! Sorry for delayed response and thanks for the question. We try to automatically transition the run to finished state when the process exits (even if exceptions are thrown). But there are cases that the process hangs or is killed, in those cases we can't do much.

However we also have a background task in aim up command as a backup plan that checks for runs that stayed in the active state and no other process is holding locks for that run (this is the case when the process is killed). So the only un-handled case should be when the process is hang. If you can provide some more details on how specifically this cases happen, maybe I can provide some more help or try to reproduce it on my end to see what's going wrong.

Jul 10 '24 23:07 mihran113