AIM Client Process Termination Leaves Run in Active State
🐛 Bug
When the AIM client process is killed, the corresponding run remains in the "In Progress" state indefinitely. I expect the run to transition to the "Finished" state upon client termination.
To reproduce
- Create a Python script
test-aim.pywith the following content:
import time
from aim import Run
run = Run(repo='aim://10.66.142.35:8082', experiment='default')
run['hparams'] = {
'learning_rate': 0.001,
'batch_size': 32,
}
for i in range(1000):
time.sleep(1)
print(f'{i}')
run.track(i+2, step=i, epoch=i%2, name='metrics-1')
- Run the script in the background:
$ nohup python test-aim.py &
- Kill the process:
$ kill 365103
[1]+ Terminated nohup python test-aim.py
Expected behavior
The run should transition to the "Finished" state after the client process is terminated.
Environment
- Aim Version: 3.22.0
- Python version: 3.10.9
- pip version: 24.0
- OS (e.g., Linux): Ubuntu 20.04.4 LTS
Additional context
Or is there any workaround for this issue?
@mihran113 Could you please give some comments, thx in advance!
Hi. Is there any update on this issue? I have the exact same issue, which forces me to having to restart my remote server more times that one would desire. Cheers, Diogo
I also have this issue
We use huggingface to train models and when our training jobs run out of memory the associated AIM Runs are left active. When the number of these runs becomes "too large" the AIM Dashboard stops working.
Hey folks. Sorry, for leaving this unattended for this long. Can I ask if the runs stay in in progress state indefinitely or they transition to finished state after a while?
This is the logic how we handle things:
We clean up client's resources after 30 minutes, cause there are cases that client and server lose connection and then re-establish it. Once the resources are cleaned up for the client, there's a background thread for aim up command, which looks for runs that are left in active state, and moves them to finished state and also does the indexing (which improves the query performance).
We can clean up the resources after shorter period if that would help, but I just want to make sure that's the issue here and not anything else.
Hi @mihran113 I was on PTO and just got back.
We have a DB with about 22k runs of which roughly 1100 AIM reports that are still active - they're not. The phantom active ones are from November/December of last year.
That said, I did notice that our AIM server was at v3.19.3 so I updated it to 3.27.0 just now. I'll let it do its thing and check it again in 18 hours to see if there're still any active runs. Then I'll report back here.
I updated the AIM server to v3.27.0 and redeployed it. Of the ~1100 falsely-active runs there's now just 1 left. This makes me think that the AIM server is working as you explained @mihran113 and that single experiment is just being annoying. I'll probably archive/delete it
we're about to run a bunch more experiments and I'll be keeping an eye on their status. If I notice any such misbehaving runs I'll report them back here.
Hi @mihran113 we started seeing this again with aim==3.29.1. We have an AIM server with about 22k aim experiments in it and roughly 250 archived ones.
We see about 150 that are in "active" state for though the actual jobs completed weeks ago. To make matters worse, we're also observing that we can create new experiments, they eventually close, but when we try to access the individual AIM page we get popup error messages in the top right corner with the following messages:
- Run not found
- Body has already been consumed
- Body is disturbed or locked
I noticed that we had 10 "corrupted" runs in the db which I deleted. Unfortunately, that didn't make a difference. Is there something else I can try ?
Hi @mihran113 we started seeing this again with aim==3.29.1. We have an AIM server with about 22k aim experiments in it and roughly 250 archived ones.
We see about 150 that are in "active" state for though the actual jobs completed weeks ago. To make matters worse, we're also observing that we can create new experiments, they eventually close, but when we try to access the individual AIM page we get popup error messages in the top right corner with the following messages:
* Run not found * Body has already been consumed * Body is disturbed or lockedI noticed that we had 10 "corrupted" runs in the db which I deleted. Unfortunately, that didn't make a difference. Is there something else I can try ?
I am experiencing the exact same with 3.29.1 as well, although I have far fewer experiments (~500).