runner-images icon indicating copy to clipboard operation
runner-images copied to clipboard

Some macos agents are slow. More than 2x slower

Open andreineculau opened this issue 3 years ago • 14 comments

Description

Similar to https://github.com/actions/virtual-environments/issues/2707 , I am noticing slower MacOS agents. It's random, but often: 10-25% of the runs are slow.

A good run can "uninstall homebrew" (i.e. mostly disk i/o) in 184 seconds https://github.com/rokmoln/support-firecloud/runs/3295490478?check_suite_focus=true#step:3:190

A slow run can "uninstall homebrew" 2.5 times slower, in 483 seconds https://github.com/rokmoln/support-firecloud/runs/3295490501?check_suite_focus=true#step:3:190

Similarly, reinstalling homebrew (disk, network and cpu bound) happens 1.8 times slower, in 593 seconds https://github.com/rokmoln/support-firecloud/runs/3295490501?check_suite_focus=true#step:3:447 instead of 333 seconds https://github.com/rokmoln/support-firecloud/runs/3295490478?check_suite_focus=true#step:3:456

Overall, my builds time out even with a 2x-than-normal timeout (normally 23minutes, timeout 45 minutes).

As requested by @miketimofeev , I have a repro workflow here https://github.com/andreineculau/actions-ve-repro-2707 , the same as @smorimoto used in #2707.

You can see a good run ending after 1m30s and running brew install systembench in 29s https://github.com/andreineculau/actions-ve-repro-2707/runs/3298362909?check_suite_focus=true as opposed to a slow run ending after 3m09s and running brew install systembench in 2m13s https://github.com/andreineculau/actions-ve-repro-2707/runs/3298362887?check_suite_focus=true .

Virtual environments affected

  • [ ] Ubuntu 16.04
  • [ ] Ubuntu 18.04
  • [ ] Ubuntu 20.04
  • [X] macOS 10.15
  • [X] macOS 11
  • [ ] Windows Server 2016
  • [ ] Windows Server 2019

Image version and build link

20210801.1

Is it regression?

No response

Expected behavior

Consistent times. I guess a <25% deviation is expected, but not >100%.

Actual behavior

More than 2x slower run times.

Repro steps

https://github.com/andreineculau/actions-ve-repro-2707/blob/85c642afe6c8d19f3e30bfa51de7d2075d6a5414/.github/workflows/workflow.yml

andreineculau avatar Aug 11 '21 09:08 andreineculau

Hey, @andreineculau I'll take a look and will return with new information soon :) Thanks for reporting!

LeonidLapshin avatar Aug 11 '21 09:08 LeonidLapshin

I modified a bit my workflow, in order to separate more network from disk i/o

brew update in 2m31s https://github.com/andreineculau/actions-ve-repro-2707/runs/3299170641?check_suite_focus=true vs 41s https://github.com/andreineculau/actions-ve-repro-2707/runs/3299170575?check_suite_focus=true

You can see though how numbers point towards disk i/o e.g. brew update in 1m12s but brew install sysbench (with no network i/o!) in 57s https://github.com/andreineculau/actions-ve-repro-2707/runs/3299207551?check_suite_focus=true vs 28s and 19s in https://github.com/andreineculau/actions-ve-repro-2707/runs/3299207510?check_suite_focus=true

andreineculau avatar Aug 11 '21 09:08 andreineculau

One last comment before I put this to rest: I wanted to print some system info, and hit a very big diff 1m26s vs 4m43s

https://github.com/andreineculau/actions-ve-repro-2707/runs/3299223287?check_suite_focus=true vs https://github.com/andreineculau/actions-ve-repro-2707/runs/3299223358?check_suite_focus=true

A command like system_profiler SPSoftwareDataType SPDeveloperToolsDataType ran in 2s or 1m9s !!!

andreineculau avatar Aug 11 '21 10:08 andreineculau

Hey, @andreineculau We have investigated this issue and created some internal tasks to fix it, we'll update this ticket when we'll get the new information :) Thanks!

LeonidLapshin avatar Aug 13 '21 09:08 LeonidLapshin

I haven't noticed flimsy performance for a while but I'm wondering if a cause has been found and a permanent fix pushed so this issue could be closed. Thanks.

andreineculau avatar Sep 04 '21 12:09 andreineculau

@andreineculau some environments were fixed and some are still in progress, so the performance can be flaky.

miketimofeev avatar Sep 06 '21 10:09 miketimofeev

@miketimofeev thanks for the update! Just to be sure, there isn't anything one can do as part of the job configuration as a local fix, right?

andreineculau avatar Sep 06 '21 11:09 andreineculau

@andreineculau unfortunately, yes. It's all about underlying infrastructure at the moment.

miketimofeev avatar Sep 06 '21 11:09 miketimofeev

Performance is so slow that any CPU-bound task, like caching, is rendered totally useless. Unpacking a 120 MB file takes longer than actually running all the different npm i that create the node_modules directories (4x as slow).

JJ avatar Jul 20 '22 07:07 JJ

@andreineculau unfortunately, yes. It's all about underlying infrastructure at the moment.

Is it possible that the fact that the runners use the CSharp runtime makes it not perform so well in a Mac environment?

JJ avatar Jul 20 '22 07:07 JJ

@JJ That's unlikely. At least it only executes system commands. In other words, there is absolutely nothing that humans can recognize about the overhead in that part, and the cause is clear from my past investigations. I'm a little confused as to why this problem still exists. See https://github.com/actions/virtual-environments/issues/2707

smorimoto avatar Jul 20 '22 14:07 smorimoto

What a rabbithole of closed issues Related issues: https://github.com/actions/runner-images/issues/2707 https://github.com/actions/runner-images/issues/1336 https://github.com/actions/runner-images/issues/6547

Control (ubuntu-22.04): image image

Macos 12: image image

Macos 11: image

Anyone got an update on the state of this work?

sidekick-eimantas avatar Nov 10 '22 19:11 sidekick-eimantas

On top of the observed fluctuations in performance (sometimes a Homebrew installation takes 10 seconds, sometimes the same installation takes 5 minutes), I also observed file I/O to be very slow recently and the Actions UI itself being unresponsive:

The runtime counter just stops, no visible log output is displayed and it can take minutes until any output is shown - I do however experience the same issue with other runners and the UI on the whole: It gets "stuck" showing a single job as active even though it has finished and doesn't update the state of other jobs.

PatTheMav avatar Nov 15 '22 11:11 PatTheMav

We are on Azure having the same issues recently. The performance of the macos VMs never has been great, but recently (1-2 weeks) the performance dropped to a new low. We have runners that run into the 60 minute timeout, that don't update the UI anymore (even after refresh etc.), jobs not even getting a runner (timeout here too, nonetheless there are runners available). I/O performance is on a new low as well, we have build tasks for an app, that took around 10min now we see build times around 15-20min, if they don't just die in the middle of the task.

Please MS/GH fix your macos setups, this is not usable anymore.

dockay avatar Nov 21 '22 15:11 dockay

Random ~100x regression on CPU heavy tasks. Here's an example, where this only happened for one Mac job (the others have normal performance): https://github.com/connorjclark/ZeldaClassic/actions/runs/3690258663/jobs/6247167952 Hope this information helps.

connorjclark avatar Dec 14 '22 00:12 connorjclark

@miketimofeev - is there any update on this? I use macos-latest, and downloading a (quite small) 26mb file sometimes takes >9 minutes. This network problem is causing some builds to be much slower than others.

niall-shaw avatar Jan 11 '23 08:01 niall-shaw

@miketimofeev Any updates on this?

smorimoto avatar Aug 24 '23 18:08 smorimoto

We are still experiencing performance issues with Xcode builds on GitHub runners.

jozefizso avatar Sep 14 '23 14:09 jozefizso

Any updates on this issue? Still a problem, and seems to get worse if you use xcode15 for testing and compiling.

He have seen cache download tasks that takes more than doble the time it would take if compiling every dependencies. And also, fluctuation in overall use time of the agent as high as 2x for the same job.

christiangal-indi avatar Sep 21 '23 09:09 christiangal-indi

Hey all! We have performed a lot of work for images optimisation and got lots of positive reports regarding images speed, the most significant up is now can be seen in OS13 runners (both intel and m1), also OS12 got speeded up as well due to some hardware modifications, we hope most of the customers will find current situation suitable for their needs, gonna close this ticket now, but we are always glad to hear feedback.

mikhailkoliada avatar Dec 08 '23 18:12 mikhailkoliada

Separately, I should note that the performance of Xcode 15 on macOS-13 is monitored in another issue https://github.com/actions/runner-images/issues/7971 and is not related to this problem and hardware in general.

erik-bershel avatar Dec 08 '23 18:12 erik-bershel

Nothing seems to be fixed. For many things that need I/O performance, macOS Runner still has the worst performance.

smorimoto avatar Dec 09 '23 14:12 smorimoto