aim icon indicating copy to clipboard operation
aim copied to clipboard

AIM Client Process Termination Leaves Run in Active State

Open zhiyxu opened this issue 1 year ago • 12 comments

🐛 Bug

When the AIM client process is killed, the corresponding run remains in the "In Progress" state indefinitely. I expect the run to transition to the "Finished" state upon client termination.

To reproduce

  1. Create a Python script test-aim.py with the following content:
import time
from aim import Run

run = Run(repo='aim://10.66.142.35:8082', experiment='default')
run['hparams'] = {
    'learning_rate': 0.001,
    'batch_size': 32,
}

for i in range(1000):
    time.sleep(1)
    print(f'{i}')
    run.track(i+2, step=i, epoch=i%2, name='metrics-1')
  1. Run the script in the background:
$ nohup python test-aim.py &
  1. Kill the process:
$ kill 365103
[1]+  Terminated              nohup python test-aim.py

image

Expected behavior

The run should transition to the "Finished" state after the client process is terminated.

Environment

  • Aim Version: 3.22.0
  • Python version: 3.10.9
  • pip version: 24.0
  • OS (e.g., Linux): Ubuntu 20.04.4 LTS

Additional context

Or is there any workaround for this issue?

zhiyxu avatar Jun 25 '24 07:06 zhiyxu

@mihran113 Could you please give some comments, thx in advance!

zhiyxu avatar Jun 25 '24 07:06 zhiyxu

Hi. Is there any update on this issue? I have the exact same issue, which forces me to having to restart my remote server more times that one would desire. Cheers, Diogo

diogo-sr avatar Jul 18 '24 12:07 diogo-sr

I also have this issue

Wingmore avatar Aug 06 '24 07:08 Wingmore

We use huggingface to train models and when our training jobs run out of memory the associated AIM Runs are left active. When the number of these runs becomes "too large" the AIM Dashboard stops working.

VassilisVassiliadis avatar Nov 25 '24 08:11 VassilisVassiliadis

Hey folks. Sorry, for leaving this unattended for this long. Can I ask if the runs stay in in progress state indefinitely or they transition to finished state after a while?

This is the logic how we handle things: We clean up client's resources after 30 minutes, cause there are cases that client and server lose connection and then re-establish it. Once the resources are cleaned up for the client, there's a background thread for aim up command, which looks for runs that are left in active state, and moves them to finished state and also does the indexing (which improves the query performance). We can clean up the resources after shorter period if that would help, but I just want to make sure that's the issue here and not anything else.

mihran113 avatar Dec 25 '24 13:12 mihran113

Hi @mihran113 I was on PTO and just got back.

We have a DB with about 22k runs of which roughly 1100 AIM reports that are still active - they're not. The phantom active ones are from November/December of last year.

That said, I did notice that our AIM server was at v3.19.3 so I updated it to 3.27.0 just now. I'll let it do its thing and check it again in 18 hours to see if there're still any active runs. Then I'll report back here.

VassilisVassiliadis avatar Jan 08 '25 11:01 VassilisVassiliadis

I updated the AIM server to v3.27.0 and redeployed it. Of the ~1100 falsely-active runs there's now just 1 left. This makes me think that the AIM server is working as you explained @mihran113 and that single experiment is just being annoying. I'll probably archive/delete it

VassilisVassiliadis avatar Jan 10 '25 14:01 VassilisVassiliadis

we're about to run a bunch more experiments and I'll be keeping an eye on their status. If I notice any such misbehaving runs I'll report them back here.

VassilisVassiliadis avatar Jan 10 '25 14:01 VassilisVassiliadis

Hi @mihran113 we started seeing this again with aim==3.29.1. We have an AIM server with about 22k aim experiments in it and roughly 250 archived ones.

We see about 150 that are in "active" state for though the actual jobs completed weeks ago. To make matters worse, we're also observing that we can create new experiments, they eventually close, but when we try to access the individual AIM page we get popup error messages in the top right corner with the following messages:

  • Run not found
  • Body has already been consumed
  • Body is disturbed or locked

I noticed that we had 10 "corrupted" runs in the db which I deleted. Unfortunately, that didn't make a difference. Is there something else I can try ?

VassilisVassiliadis avatar Jun 03 '25 14:06 VassilisVassiliadis

Hi @mihran113 we started seeing this again with aim==3.29.1. We have an AIM server with about 22k aim experiments in it and roughly 250 archived ones.

We see about 150 that are in "active" state for though the actual jobs completed weeks ago. To make matters worse, we're also observing that we can create new experiments, they eventually close, but when we try to access the individual AIM page we get popup error messages in the top right corner with the following messages:

* Run not found

* Body has already been consumed

* Body is disturbed or locked

I noticed that we had 10 "corrupted" runs in the db which I deleted. Unfortunately, that didn't make a difference. Is there something else I can try ?

I am experiencing the exact same with 3.29.1 as well, although I have far fewer experiments (~500).

alexjwilliams avatar Jun 17 '25 20:06 alexjwilliams