checkout Checkout suddenly much slower on windows environment

Moved from https://github.com/actions/runner-images/issues/7180

Description

Since a few days, the duration of the actions/checkout@v3 steps on windows-2019 have dramatically increased. This behavior is seen on all my repos (all private). Below a table showing an example of before/after.

Checkout step	Feb 17th	Feb 21st
1	`13s`	`1m35s`
2	`8s`	`47s`

The result is a huge increase in build (and billable) time.

The github status page does show some issues around this time frame, but these were all solved:

Platforms affected

[ ] Azure DevOps
[X] GitHub Actions - Standard Runners
[ ] GitHub Actions - Larger Runners

Runner images affected

[ ] Ubuntu 18.04
[ ] Ubuntu 20.04
[ ] Ubuntu 22.04
[ ] macOS 11
[ ] macOS 12
[X] Windows Server 2019
[ ] Windows Server 2022

Image version and build link

Current runner version: '2.301.1'
Operating System
  Microsoft Windows Server 2019
  10.0.17763

Private repo

Is it regression?

Yes, sorry private repos

Expected behavior

The build times should be fairly constant.

Actual behavior

Build times explode. Burning down our build minutes too fast.

Repro steps

Compare build times on any windows environment from before Feb 18th with today.

Feb 23 '23 17:02 yeu-degroot

I have observed this too. Example job: https://github.com/Tyrrrz/CliWrap/actions/runs/4271628026

Windows – checkout took 4m9s:

Ubuntu – checkout took <1s:

macOS – checkout took 3s:

Note that this behavior is pretty inconsistent, and usually checkout completes pretty fast on Windows. But in the event that it is slow, it's only ever slow on Windows.

Mar 01 '23 04:03 Tyrrrz

I've encountered this and debugged it a bit, by heavily instrumenting the transpiled code in dist/index.js. Turns out that there are multiple calls to PowerShell that make a mess by essentially waiting for dozens of seconds. Here is the example run I analyzed:

here, 1 minute and 14 seconds are spent,
here, 11 seconds are spent,
here, 12 seconds are spent,
here, 11 seconds are spent,

all on this PowerShell call:

(Get-CimInstance -ClassName Win32_OperatingSystem).caption

totaling 1 minute and 48 whopping seconds just to determine a Windows release that the actions/checkout Action has no need to know.

To make things even worse, the same PowerShell invocation also happens multiple times during the post-action:

I do not really understand what changed between Feb 17th and 21st that would explain this slowdown. There has not been a new actions/checkout release since Jan 5th...

Mar 03 '23 12:03 dscho

I have briefly seen success trying to hard-code the fix of https://github.com/sindresorhus/windows-release/pull/18 into dist/index.js but it does not seem to work when I undelete all the core logic. sigh

At this point, I have to tend to other things, but I thought I'd leave my current findings here in case somebody else can take over (and try things like overriding os.release() to always return 2022 or something like that, before letting @octokit/endpoint initialize its User-Agent string).

Mar 03 '23 13:03 dscho

I made an attempt to avoid calling into windows-releases from the checkout action in #1246.

If you want to try you can use

- uses: BrettDong/checkout@octokit

and see if there is any improvement in time stalled between starting the checkout action and performing actual git operations.

Mar 26 '23 11:03 BrettDong

@BrettDong excellent! I tested this and the times are back to decent levels: the entire checkout step takes 11 seconds with your PR. Thank you so much!

Mar 26 '23 19:03 dscho

The fix in #1246 reduces stalled time down to 3 seconds.

During the 3 seconds the workflow is stalled on loading the node.exe from disk to memory for execution. I don't think there is anything I can do to get rid of it.

Mar 27 '23 00:03 BrettDong

I have had a similar issue with large runners with slow checkout and cleanup that I reported to GitHub Support. They concluded that it is related this issue, even though I am not completely convinced.

The screenshot from @Tyrrrz earlier in this issue also shows a slow post checkout (cleanup).

Workflow:

To recreate as a minimum solution, I created a new repository with only a single workflow file, spinning up 3 jobs on default and large runners:

name: Demo Slow Checkout

on:
  workflow_dispatch:

permissions:
  contents: read

jobs:

  doSomething:
  
    name: My Job

    strategy:
      matrix:
        os: [windows-latest, windows-latest-8core, windows-latest-16core]

    runs-on: ${{ matrix.os }}

    steps:
    - name: Checkout
      uses: actions/checkout@8f4b7f84864484a7bf31766abe9204da3cbe65b3 # v3.5.0
      # https://github.com/actions/checkout/releases
      with:
        # Fetch full depth
        fetch-depth: 0

    - name: Work for 15 seconds
      run: Start-Sleep -Seconds 15

In addition to checkout the job has only one other step, and that is to sleep for 15 seconds.

Results:

The jobs executed 10 times, and shows that:

Checkout
- Normal time is 19 seconds
- Large runners use more than twice the time (between 47 and 51 seconds)
Post Checkout
- Normal time is 3 seconds
- Large runners use 15 times more time (between 46 and 49 seconds)

ℹ️ The last row is median and not average, so that any single slow run should interfere with the result.

Findings:

Finding 1:

Every single checkout on the large runners is at least twice as slow as on regular runner, and all the times goes before the actual checkout starts:

Finding 2:

The post checkout (cleanup) is on average 15 times slower as on regular runner, and all the time also goes before any cleanup is started:

Finding 3:

The simple sleep-task on regular runner uses twice the time of the sleep interval. How is it even possible that a sleep for 15 seconds takes almost double the time? This was done with a simple run: Start-Sleep -Seconds 15.

Mar 27 '23 10:03 Gakk

You can build something like https://github.com/BrettDong/Cataclysm-DDA/blob/etw/.github/workflows/etw.yml to collect ETL traces in the runner to diagnose what is happening and spending time during checkout.

Mar 27 '23 10:03 BrettDong

@BrettDong

During the 3 seconds the workflow is stalled on loading the node.exe from disk to memory for execution. I don't think there is anything I can do to get rid of it.

This seems like an interesting point in and of itself. Because in https://github.com/actions/runner-images/issues/7320 whilst we report actions/checkout issues, we see much bigger problems during workflows which heavily rely on disk access (i.e. cache hits during test runs).

So if I understand your conclusion, the narrowing down of this to being a disk access issue (or something that limits the effective disk access rate), this is consistent with what we are seeing.

Mar 27 '23 16:03 myitcv

I have had a similar issue with large runners with slow checkout and cleanup that I reported to GitHub Support. They concluded that it is related this issue, even though I am not completely convinced.

Same here - we've noticed that large GitHub-managed Windows runners are substantially slower during checkout. This is not a recent regression for us though - they've been (nearly unuseably) slow for months.

Runner	Checkout time
Self hosted Linux	40s
small GitHub-managed windows (`windows-2022`)	2m30s
large GitHub-managed Windows	12m

We also have a ticket with GitHub Support, and I've been running experiments for our repo / workflows at https://github.com/openxla/iree/pull/12051.

Apr 04 '23 18:04 ScottTodd

Great news - build times on windows environment are back to normal! This was fixed in actions/[email protected].

Thanks @BrettDong and @fhammerl

Apr 19 '23 14:04 yeu-degroot

FYI the issue that @ScottTodd and I are seeing on the large Windows managed runners was not fixed by this update. We tested it prior to release at the direction of GitHub support:

yaml file (same location): https://github.com/ScottTodd/iree/blob/ci-windows-checkout-debug/.github/workflows/ci.yml
logs: https://github.com/openxla/iree/actions/runs/4610834430

Seems like it may be a separate issue, but just wanted to call it out since these issues seem like they were maybe merged. Seems like this is also something that @Gakk is hitting. My understanding from support is that they're still investigating this other problem. It may be worth opening a separate issue for this or leaving this open.

Apr 19 '23 16:04 GMNGeoffrey

https://github.com/actions/checkout/pull/1246#issuecomment-1499276093

try: https://github.com/actions/checkout/issues/1186#issuecomment-1484896561

Apr 19 '23 16:04 ofek

Seems like this is also something that @Gakk is hitting.

@GMNGeoffrey, I have done extensive testing and confirmed that my issues were resolved by actions/checkout version 3.5.1.

Apr 19 '23 21:04 Gakk

Yep, I'm still seeing substantially slower checkouts on large runners (could break that out into a different issue, and we have a support ticket for it). Latest experiments on https://github.com/openxla/iree/pull/12051, logs at https://github.com/openxla/iree/actions/runs/4748455722/jobs/8434667258. Our repo depends on https://github.com/llvm/llvm-project/ (very large) and a few other submodules, and just the checkout takes ~1 minute on small Windows runners but 7+ minutes on large Windows runners. We've tried all sorts of ways to change the git commands used (sparse checkouts, shallow clones, caching of git files, etc.) but can't get past whatever the differences are between the runners themselves.

Apr 19 '23 22:04 ScottTodd

As mentioned in a comment above, can you try this? https://github.com/BrettDong/Cataclysm-DDA/blob/etw/.github/workflows/etw.yml

I don't have access to larger runners currently to test myself.

Apr 19 '23 22:04 ofek

I'm still seeing substantially slower checkouts on large runners

Just a wild guess: could it be that large runners have slower D: drives than smaller runners? IIRC the hosted runners specifically have very fast D: drives.

Apr 19 '23 22:04 dscho

I've just verified that the issue is not with the runners themselves, but rather with actions/checkout. Using just the normal bash commands, I did full checkout with submodules in 1m30s, compared to almost 10m previously. By dropping actions/checkout and actions/cache and rolling our own pretty unsophisticated solutions, I've been able to drop a maximally cached build from 20m (and that's when fetching the cache doesn't just fail entirely) to 9m. That is just really pretty sad

Apr 20 '23 05:04 GMNGeoffrey

You can collect ETW traces to help diagnose what's happening and taking time during checkout action.

Apr 20 '23 06:04 BrettDong

You can collect ETW traces to help diagnose what's happening and taking time during checkout action.

As mentioned in a comment above, can you try this? https://github.com/BrettDong/Cataclysm-DDA/blob/etw/.github/workflows/etw.yml

I don't have access to larger runners currently to test myself.

Yeah, but presumably so can the GitHub engineers who support says are working to fix this. Like, IDK, it kind of seems to me that the people who wrote this code, control these VMs, and whom we are paying for this service, could maybe take a look at the issues with it.

Apr 20 '23 20:04 GMNGeoffrey

@GMNGeoffrey I would like to encourage you to consider the current macro-economical climate, and also just how large he roadmap is. And please also note how @BrettDong's effort was rewarded: I am sure that you also will get what you want much quicker if you dig a little deeper with those ETW traces. I would even consider to help, but I do not have access to those large runners; You do, though.

Apr 21 '23 13:04 dscho

~~👋 hey sorry for not seeing this one, we are tracking/providing commentary on our investigation on the windows larger runner checkout here: oops thats the internal issue and no good to anyone~~

Ok that first update was all kinds of wrong let me try again!

Sorry we haven't commented on this ticket still, we are tracking this internally. We are have made some changes to the Windows VM image but only recently and the don't appear to have helped. With everything else going on we have had to put this one aside as well for the last couple of weeks but we are committed to fixing this. I will re-open this ticket as this is linked in the issue we are tracking internally :)

May 22 '23 12:05 nebuk89

Thanks Ben. Further runs suggest that my switch to use git commands directly instead of actions/checkout was just lucky the first few times (or the computer learned what I was trying to do and put a stop to it :stuck_out_tongue:). Subsequent runs have had similar latency to before the switch, I think (I started hacking together a script to collect statistics for jobs over time, but got side-tracked, so pure anecdata right now). So I'm back to thinking it's the VM+git itself and not the action. I am sort of considering getting tarballs for all of the submodules instead of using git... I'll update if that seems to be faster somehow (which would suggest to me something git-specific and not just IO or network issues)

May 23 '23 00:05 GMNGeoffrey

Thanks @GMNGeoffrey for having a go! (and sorry that the computers are learning 😆 ) Let me know if it's faster and we will hopefully have our focus back on this in the next week or so as things settle (also turns out I am not a maintainer on this repo, I will find someone to re-open this for me :D)

May 23 '23 08:05 nebuk89

Just a reminder, given the evidence in https://github.com/actions/runner-images/issues/7320, there is almost certainly an underlying issue that is not specific to git at all. Rather, all operations that place high demands on disk (and CPU).

May 23 '23 13:05 myitcv

Re-opening at @nebuk89's request, so we can track our efforts externally as we investigate further. Some valuable context built up in this thread 😄 .

May 23 '23 13:05 benwells

I'm not using these runners, but if the OS is showing high CPU on disk access, perhaps its due to a Host Disk Caching setting set by Azure on the VM Disk (see https://learn.microsoft.com/en-us/azure/virtual-machines/disks-performance) as while the Host Disk cache benefits for some modes, it can also add a penalty.

Aug 30 '23 21:08 Plasma

Not sure if it helps - since we are running self-hosted gitlab, but I started looking for solution since our Windows runners are incredibly slow. Simple build jobs (for instance just perform MSBuild on a solution) that finish in less than 1 minute when run manually on the same machine - take over an hour when run as gitlab-runner job. The very same script is executed, no manual deviation in the two procedures. Further potentially helpful details:

We have runners on virtual machines and on physical ones - makes no difference, both are concerned
Runners use Windows 10
There is no load to be seen, not on CPU, nor memory, nor disk, just all silent and idle
I so far failed to pinpoint the delays to any step. E. g. simply the output of MSBuild is tremendously slowed down. Lines that pass within fractions of a second normally drop in every other minute only

Mar 27 '24 16:03 StephanKirsch

It almost sounds like the runner is maybe slow / blocking reading stderr/out of the console windows of the processes it’s launched, which is blocking the process advancing.

Mar 27 '24 21:03 Plasma

We are discontinuing our use of GH managed windows runners. The costs were already beyond premium/sustainable, and the performance is so poor that the issue compounds out of control. I don't consider this a viable way to run CI for any business.

I can tolerate a lot but not at massively inflated prices.

Apr 23 '24 19:04 stellaraccident