aim icon indicating copy to clipboard operation
aim copied to clipboard

Some runs are marked as active while the training process has finished

Open gorarakelyan opened this issue 2 years ago • 1 comments

🐛 Bug

Issue with detecting finished runs. Some runs are marked as active/in-progress, even if the training process has finished or has been terminated.

Expected behavior

Properly detect when the training process is finished and finalize the corresponding Aim Run:

  • set run.active property to False
  • set run finalization time

Environment

  • Aim Version - Aim v3.13.2

gorarakelyan avatar Sep 16 '22 15:09 gorarakelyan

Might depend on the setup and particularly on the workflow automation tool. I have come across some cases, when workload managers (e.g. slurm) just kill the process with SIGKILL (-9 signal).

gorarakelyan avatar Sep 16 '22 15:09 gorarakelyan

Had the same issue here. I just load the Run manually and make it inactive by hand:

import aim
run = aim.Run(run_hash='HASH_OF_THAT_EXPERIMENT', repo='...')
del run

we-taper avatar Oct 18 '22 14:10 we-taper

Would be cool if aim storage reindex could fix this, as it is - in essence - a stalled run which reindex claims to detect and fix.

Pyrestone avatar Oct 20 '22 08:10 Pyrestone

@we-taper @Pyrestone the fix has been shipped with Aim v3.15. Could you please upgrade to the latest version and check if it works as exptected?

gorarakelyan avatar Dec 09 '22 17:12 gorarakelyan

closing this issue as the fix was shipped. please feel free to reopen this if the issue reappears

gorarakelyan avatar Feb 10 '23 12:02 gorarakelyan