aim icon indicating copy to clipboard operation
aim copied to clipboard

Aim UI is slow

Open dcarrion87 opened this issue 1 year ago • 9 comments

🐛 Bug

  • After a few weeks of production usage we've found the Aim UI to be painfully slow. Sometimes it can take over 1 minute to load 50 runs. We're getting pretty worried what this is going to look like after a few months.

  • We're wanting to upgrade to 3.17.4 to see if there's an improvement but Aim's breaking compatibility between versions mean we're having to ping staff and wait for them before we can update version.

  • Is there a guide on most optimal configuration? Minimum CPU, RAM, workers, etc...

To reproduce

  • Do some logging.
  • Try to load the UI. Wait anywhere from 5 - 60 seconds. Seen it not load at all sometimes.
  • See below ramp up during run page load. It all seems to be in user space.
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.04    0.00    5.03    0.00    0.00   86.93
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.00    0.00    4.00    0.00    0.00   91.00
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.92    0.00    4.90    0.00    0.00   91.18
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.19    0.00    7.22    0.52    0.00   86.08
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          10.42    0.00   10.42    2.08    0.00   77.08
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          51.52    0.00    6.06    0.00    0.00   42.42
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          51.52    0.00    4.04    0.00    0.00   44.44
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          49.25    0.00    5.53    0.00    0.00   45.23
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          49.75    0.00    6.53    0.00    0.00   43.72
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          52.02    0.00    6.57    0.00    0.00   41.41
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          51.02    0.00    4.08    0.00    0.00   44.90
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          52.82    0.00    4.10    0.00    0.00   43.08
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          48.74    0.00    5.03    0.00    0.00   46.23
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          46.19    0.00    3.55    0.00    0.00   50.25
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          10.66    0.00    7.11    2.03    0.00   80.20
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          35.57    0.00   11.34    0.52    0.00   52.58
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          39.90    0.00    4.15    0.00    0.00   55.96
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.05    0.00    2.56    0.00    0.00   95.38
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.02    0.00    3.03    0.00    0.00   94.95

Expected behavior

  • Aim UI not slow.

Environment

  • Aim Version 3.17.3
  • Kubernetes
  • 2CPU and 2GB RAM
  • EFS backed

dcarrion87 avatar May 17 '23 02:05 dcarrion87

@dcarrion87 thanks for opening this issue, Any chance you could share some dimensions for how much you have logged? Feel free to dm me if cannot be shared publicly.

SGevorg avatar May 17 '23 06:05 SGevorg

@dcarrion87 thanks for opening this issue, Any chance you could share some dimensions for how much you have logged? Feel free to dm me if cannot be shared publicly.

~270 metrics per run which are logged once per epoch over 20-30 epochs. Potential exception to this is loss metrics which happen more frequently.

Any insight into recommeneded specs, worker configurations, etc... would also help. Not sure what the optimal configuration is for the UI.

dcarrion87 avatar May 17 '23 07:05 dcarrion87

@dcarrion87, upgrading to Aim 3.17.4 should be OK. There are no data format changes, but you might need to stop/start tracking if you are using remote tracking.

w.r.t. dimensions;

Sometimes it can take over 1 minute to load 50 runs

Do you see the slowness in Runs Explorer page or the Metrics Explorer?

~270 metrics per run which are logged once per epoch over 20-30 epochs. Potential exception to this is loss metrics which happen more frequently.

How many runs do you have? And how many of those are marked as "active"?

alberttorosyan avatar May 17 '23 07:05 alberttorosyan

@alberttorosyan

  • We can't upgrade yet because it breaks remote tracking clients unless they're also at 3.17.4. We're waiting for users to upgrade clients. This can be teething. The only product we use where the server is not client backward compatible on patch releases.

  • Slowness in runs explorer page and metrics explorer page

  • 50-60 runs. We archived some but we do not want to have to follow up users cleaning up. By that stage it's too late / the UI is too slow for them to operate and we're going in manually to help them. We could quite easily see this get to 100s to 1000s of runs at some point.

What are the recommended specs? Do you have a table that details metric, run numbers, etc... and compute spec requirements, worker config, etc... for UI to operate normally.

dcarrion87 avatar May 17 '23 07:05 dcarrion87

@alberttorosyan

We have runs sitting in active forever by the looks of it. Even archived ones. I'm not sure why they never mark as closed. From an operational perspective it's quite frustrating.

dcarrion87 avatar May 17 '23 07:05 dcarrion87

@dcarrion87, actually the requirements to run Aim UI are quite modest. Even with the specs you've mentioned (2 CPU, 2 GB RAM) it should handle queries normally, especially with the small number of runs. The problem is on the storage level; the locking issue is not allowing to optimize storage and index runs data. Taking into account the impossibility of migrating all the clients to the latest version, I can suggest the following workaround:

  • Create an isolated Python environment (virtualenv, etc.) and install latest Aim.
  • Run Aim UI from the "new" environment.
  • Leave the tracking server and clients to run on Aim 3.17.3.
  • As there's no data format change, this should work.

We have runs sitting in active forever by the looks of it. Even archived ones. I'm not sure why they never mark as closed. From an operational perspective it's quite frustrating.

Sorry to hear that; Aim 3.17.4 patch release targets these specific issues. Understanding the blocker you have for upgrading, the workaround above seems to be the best option for you at this moment.

alberttorosyan avatar May 17 '23 07:05 alberttorosyan

It's not urgent / users are dealing with it at the moment. We have a hard cut off on Monday to upgrade to 3.17.4 whether users have updated their clients or not. Will see how it performs then. We may do it earlier.

dcarrion87 avatar May 17 '23 07:05 dcarrion87

Just an update. It's now useable but still a bit slow loading. Teams have paused using it at the moment and reverted back to old experiment tracking tools due to issues with multi gRPC that we don't know the root cause of as of yet.

dcarrion87 avatar May 30 '23 04:05 dcarrion87

we have been experiencing similar issues on 3.17.5 as well, deployed on Kubernetes, with few hundred runs and around 3k metrics monitored overall UI pages (runs most affected) take forever to load or report an error

in addition, deployment specs for Kubernetes do not specify limit on CPUs but aim still ends up using only a single CPU, which might be causing (additional) issues

hstojic avatar Sep 27 '23 09:09 hstojic