aim
aim copied to clipboard
Some runs are marked as active while the training process has finished
🐛 Bug
Issue with detecting finished runs. Some runs are marked as active/in-progress, even if the training process has finished or has been terminated.
Expected behavior
Properly detect when the training process is finished and finalize the corresponding Aim Run:
- set
run.active
property toFalse
- set run finalization time
Environment
- Aim Version - Aim v3.13.2
Might depend on the setup and particularly on the workflow automation tool. I have come across some cases, when workload managers (e.g. slurm) just kill the process with SIGKILL (-9 signal).
Had the same issue here. I just load the Run
manually and make it inactive by hand:
import aim
run = aim.Run(run_hash='HASH_OF_THAT_EXPERIMENT', repo='...')
del run
Would be cool if aim storage reindex
could fix this, as it is - in essence - a stalled run which reindex
claims to detect and fix.
@we-taper @Pyrestone the fix has been shipped with Aim v3.15. Could you please upgrade to the latest version and check if it works as exptected?
closing this issue as the fix was shipped. please feel free to reopen this if the issue reappears