Vassilis Vassiliadis
Vassilis Vassiliadis
@hongjun0619 check this: datashim-io/datashim#60 - you may have to modify the `sed` lines a bit.
Hi @liyancn I just tested the example you provided and after making a small change (#58) it worked, thank you!
We use huggingface to train models and when our training jobs run out of memory the associated AIM Runs are left active. When the number of these runs becomes "too...
Hi @mihran113 I was on PTO and just got back. We have a DB with about 22k runs of which roughly 1100 AIM reports that are still active - they're...
I updated the AIM server to v3.27.0 and redeployed it. Of the \~1100 falsely-active runs there's now just 1 left. This makes me think that the AIM server is working...
we're about to run a bunch more experiments and I'll be keeping an eye on their status. If I notice any such misbehaving runs I'll report them back here.
Hi @mihran113 we started seeing this again with aim==3.29.1. We have an AIM server with about 22k aim experiments in it and roughly 250 archived ones. We see about 150...
Hello @SGevorg I'm also facing this issue. I'd like to store all the system metrics from my multiple workers when using accelerate in a single AIM run. Instead, I observe...
@mihran113 are there any plans about aggregating the system metrics of multiple node under a single AIM run ?
I opened a PR here: https://github.com/aimhubio/aim/pull/3284 let me know if you're interested in it.