aim
aim copied to clipboard
Aim UI is slow
🐛 Bug
-
After a few weeks of production usage we've found the Aim UI to be painfully slow. Sometimes it can take over 1 minute to load 50 runs. We're getting pretty worried what this is going to look like after a few months.
-
We're wanting to upgrade to 3.17.4 to see if there's an improvement but Aim's breaking compatibility between versions mean we're having to ping staff and wait for them before we can update version.
-
Is there a guide on most optimal configuration? Minimum CPU, RAM, workers, etc...
To reproduce
- Do some logging.
- Try to load the UI. Wait anywhere from 5 - 60 seconds. Seen it not load at all sometimes.
- See below ramp up during run page load. It all seems to be in user space.
avg-cpu: %user %nice %system %iowait %steal %idle
8.04 0.00 5.03 0.00 0.00 86.93
--
avg-cpu: %user %nice %system %iowait %steal %idle
5.00 0.00 4.00 0.00 0.00 91.00
--
avg-cpu: %user %nice %system %iowait %steal %idle
3.92 0.00 4.90 0.00 0.00 91.18
--
avg-cpu: %user %nice %system %iowait %steal %idle
6.19 0.00 7.22 0.52 0.00 86.08
--
avg-cpu: %user %nice %system %iowait %steal %idle
10.42 0.00 10.42 2.08 0.00 77.08
--
avg-cpu: %user %nice %system %iowait %steal %idle
51.52 0.00 6.06 0.00 0.00 42.42
--
avg-cpu: %user %nice %system %iowait %steal %idle
51.52 0.00 4.04 0.00 0.00 44.44
--
avg-cpu: %user %nice %system %iowait %steal %idle
49.25 0.00 5.53 0.00 0.00 45.23
--
avg-cpu: %user %nice %system %iowait %steal %idle
49.75 0.00 6.53 0.00 0.00 43.72
--
avg-cpu: %user %nice %system %iowait %steal %idle
52.02 0.00 6.57 0.00 0.00 41.41
--
avg-cpu: %user %nice %system %iowait %steal %idle
51.02 0.00 4.08 0.00 0.00 44.90
--
avg-cpu: %user %nice %system %iowait %steal %idle
52.82 0.00 4.10 0.00 0.00 43.08
--
avg-cpu: %user %nice %system %iowait %steal %idle
48.74 0.00 5.03 0.00 0.00 46.23
--
avg-cpu: %user %nice %system %iowait %steal %idle
46.19 0.00 3.55 0.00 0.00 50.25
--
avg-cpu: %user %nice %system %iowait %steal %idle
10.66 0.00 7.11 2.03 0.00 80.20
--
avg-cpu: %user %nice %system %iowait %steal %idle
35.57 0.00 11.34 0.52 0.00 52.58
--
avg-cpu: %user %nice %system %iowait %steal %idle
39.90 0.00 4.15 0.00 0.00 55.96
--
avg-cpu: %user %nice %system %iowait %steal %idle
2.05 0.00 2.56 0.00 0.00 95.38
--
avg-cpu: %user %nice %system %iowait %steal %idle
2.02 0.00 3.03 0.00 0.00 94.95
Expected behavior
- Aim UI not slow.
Environment
- Aim Version 3.17.3
- Kubernetes
- 2CPU and 2GB RAM
- EFS backed
@dcarrion87 thanks for opening this issue, Any chance you could share some dimensions for how much you have logged? Feel free to dm me if cannot be shared publicly.
@dcarrion87 thanks for opening this issue, Any chance you could share some dimensions for how much you have logged? Feel free to dm me if cannot be shared publicly.
~270 metrics per run which are logged once per epoch over 20-30 epochs. Potential exception to this is loss metrics which happen more frequently.
Any insight into recommeneded specs, worker configurations, etc... would also help. Not sure what the optimal configuration is for the UI.
@dcarrion87, upgrading to Aim 3.17.4
should be OK. There are no data format changes, but you might need to stop/start tracking if you are using remote tracking.
w.r.t. dimensions;
Sometimes it can take over 1 minute to load 50 runs
Do you see the slowness in Runs Explorer page or the Metrics Explorer?
~270 metrics per run which are logged once per epoch over 20-30 epochs. Potential exception to this is loss metrics which happen more frequently.
How many runs do you have? And how many of those are marked as "active"?
@alberttorosyan
-
We can't upgrade yet because it breaks remote tracking clients unless they're also at 3.17.4. We're waiting for users to upgrade clients. This can be teething. The only product we use where the server is not client backward compatible on patch releases.
-
Slowness in runs explorer page and metrics explorer page
-
50-60 runs. We archived some but we do not want to have to follow up users cleaning up. By that stage it's too late / the UI is too slow for them to operate and we're going in manually to help them. We could quite easily see this get to 100s to 1000s of runs at some point.
What are the recommended specs? Do you have a table that details metric, run numbers, etc... and compute spec requirements, worker config, etc... for UI to operate normally.
@alberttorosyan
We have runs sitting in active forever by the looks of it. Even archived ones. I'm not sure why they never mark as closed. From an operational perspective it's quite frustrating.
@dcarrion87, actually the requirements to run Aim UI are quite modest. Even with the specs you've mentioned (2 CPU, 2 GB RAM) it should handle queries normally, especially with the small number of runs. The problem is on the storage level; the locking issue is not allowing to optimize storage and index runs data. Taking into account the impossibility of migrating all the clients to the latest version, I can suggest the following workaround:
- Create an isolated Python environment (virtualenv, etc.) and install latest Aim.
- Run Aim UI from the "new" environment.
- Leave the tracking server and clients to run on Aim
3.17.3
. - As there's no data format change, this should work.
We have runs sitting in active forever by the looks of it. Even archived ones. I'm not sure why they never mark as closed. From an operational perspective it's quite frustrating.
Sorry to hear that; Aim 3.17.4
patch release targets these specific issues. Understanding the blocker you have for upgrading, the workaround above seems to be the best option for you at this moment.
It's not urgent / users are dealing with it at the moment. We have a hard cut off on Monday to upgrade to 3.17.4 whether users have updated their clients or not. Will see how it performs then. We may do it earlier.
Just an update. It's now useable but still a bit slow loading. Teams have paused using it at the moment and reverted back to old experiment tracking tools due to issues with multi gRPC that we don't know the root cause of as of yet.
we have been experiencing similar issues on 3.17.5 as well, deployed on Kubernetes, with few hundred runs and around 3k metrics monitored overall UI pages (runs most affected) take forever to load or report an error
in addition, deployment specs for Kubernetes do not specify limit on CPUs but aim still ends up using only a single CPU, which might be causing (additional) issues