aim icon indicating copy to clipboard operation
aim copied to clipboard

Cannot delete runs

Open rdilip opened this issue 1 year ago • 6 comments

🐛 Bug

When I click on delete run from the UI, I get a popup "Error: cannot delete run."

To reproduce

First time AIM user so open to suggestions.

Expected behavior

Environment

  • Aim 3.17.5
  • Python 3.10.11
  • pip 23.1.2
  • Linux

Additional context

Restarting doesn't help; if there's some sort of telemetry log someone can point me to, that'd be quite helpful.

rdilip avatar Jul 28 '23 05:07 rdilip

Hey @rdilip!

Do you see any errors/warnings in the terminal running aim up command? One possible reason could be the fact that the run is still in progress. It is required that run is properly closed, to prevent possible data corruption and/or tracking process failures.

alberttorosyan avatar Jul 28 '23 07:07 alberttorosyan

Hey @alberttorosyan, thanks for the quick response. I get an error saying that the run is locked, how do I unlock it? I think it would be good to have this documented somewhere, since just skimmingt he docs I don't find anything.

rdilip avatar Jul 28 '23 16:07 rdilip

Hey @rdilip, within your .aim folder, you should locate a locks folder that contains softlocks for each locked hash. Hence, all you need to do is identify the hashes and remove the locks that are no longer in use. However, I find this process rather inconvenient. Unfortunately, when I use my debugger in VSCode and stop the debugger (due to an error or whatever reason), the run.close() function is not called. As a result, each of my aborted runs remains visible. These runs are evidently unimportant for my experiments and should be easily deletable. Regrettably, the presence of these locks complicates the matter. Even if it appears that the run is active, the UI indicates that it's finished (though 'aborted' would be a more accurate term here). @alberttorosyan maybe it makes sense to periodically check if a run receives updates and auto unlock them after some time? I did not look into the code, but I guess its a doable feature and would highly improve the cleanup process. I could help coming up with a PR

mauricekraus avatar Aug 08 '23 06:08 mauricekraus

@mauricekraus, thanks for the detailed description and for the suggestion! We do have a mechanism to detect the stalled runs, which is used to index the runs data (for performance reasons). However, we do not remove the locks automatically, since it might lead to a situation where more than one process tries to write to the same run and introduce data inconsistencies. Instead the manual action of closing runs should be used: aim runs close <RUN_HASH> <RUN_HASH> ... Once this is done, it's safe to delete or resume the run.

alberttorosyan avatar Aug 08 '23 07:08 alberttorosyan

Okay, I understand your point, but why shouldn't it be possible to delete runs from the UI in either way? It doesn't matter if there are data inconsistencies if the run will be deleted anyway. Currently, deleting a run is somehow annyoing to me.

I think throwing an exception (and catching) because the run is not present anymore is a risk to take.

mauricekraus avatar Aug 08 '23 08:08 mauricekraus

@alberttorosyan Would it be possible to add something to the UI that allows us to unlock files without having to inspect the hashes? Honestly, the biggest usecase for me is I often have a bunch of 5 second runs where it immediately hits an error or I notice something isn't right, and a nice thing about aim is I can do a query like run.duration < 30 and (not run.active). It would be a huge add for me personally if I could just mass delete all of these without too much extra parsing.

rdilip avatar Aug 15 '23 18:08 rdilip