cache icon indicating copy to clipboard operation
cache copied to clipboard

Extremely slow cache on self-hosted from time to time

Open levonet opened this issue 2 years ago • 43 comments

It usually takes a few seconds to load the cache. But from time to time it takes a very long time. In my opinion, this has something to do with the slow operation of the Github API. But I can't prove it yet.

I would like to have some transparency about what happens when the action returns a response like Received 0 of XXX (0.0%), 0.0 MBs/sec. This will allow us to localize and solve the problem completely, or find a workaround.

Unknown-4

levonet avatar Jan 27 '22 09:01 levonet

I have been facing with similar situation image

akushneruk avatar Jan 29 '22 21:01 akushneruk

Seeing the same behaviour on self-hosted (GCP, europe-west3) and hosted runners. Unfortunately, it's not consistent and debugging the hosts when it's occurring doesn't point to anything on the boxes themselves.

es avatar Jan 30 '22 03:01 es

In our case, the severity rose to critical, because it affected the operation of the CD pipeline. In our case, self-hosted runners are running in GCP southamerica-east1. image

levonet avatar Jan 31 '22 09:01 levonet

Experiencing the same issues on GCP, I had to implement a timeout on the whole job because the cache action routinely takes 10 plus minutes for 30MB

SophisticaSean avatar Feb 02 '22 02:02 SophisticaSean

Our workaround

    - name: Cache mvn repository
      uses: actions/cache@v2
      timeout-minutes: 5
      continue-on-error: true
      with:
        path: ~/.m2/repository
        key: cache-m2-${{ hashFiles('pom.xml') }}
        restore-keys: |
          cache-m2-

levonet avatar Feb 04 '22 22:02 levonet

@levonet @akushneruk @SophisticaSean @es Are you still seeing the same behaviour on self-hosted runners? If yes, can one of you please enable debugging and the share log file here. Thank you.

vsvipul avatar May 11 '22 06:05 vsvipul

There have been no problems in the last month. Знімок екрана 2022-05-12 о 15 40 53

levonet avatar May 12 '22 12:05 levonet

Hey, We are seeing this happening fairly regularly with our customers at BuildJet for GitHub Actions.

Most of the reports in this thread are from people self-hosting, but this issue doesn't seem to be unique to self-hosting, the hosted runners have the same issue. I don't know whether you have more insights into workflow runs that use native runners, but here is one example from Google Chrome Lighthouse where the actions/cache@v3 step is failing exactly like described in this thread.

As we have a lot of traffic running on our infrastructure, we really notice when bugs happen with commonly used actions, and we'd be more than happy giving providing you with insights into how users are failing. You can reach me at [email protected]

@vsvipul

adamshiervani avatar May 12 '22 16:05 adamshiervani

@adamshiervani The workflow you pointed to ran 18 days ago. Can you share anything which happened in the last week. 1 point of slowness might be an outlier in which case an upgrade/downtime might have happened? If this happens every few days, I'd like to know more about the frequency.

vsvipul avatar May 16 '22 17:05 vsvipul

We've been experiencing this intermittently on builds (not self-hosted). When it happens, the cache takes 12+ hours to retrieve. It would be nice if it were possible to set a timeout to deal with this.

txtsd avatar May 17 '22 11:05 txtsd

@vsvipul - we are heavy users of BuildJet and this action and this is probably the most common issues we face.

Here are some runs for you: https://github.com/WaterfoxCo/Waterfox/runs/7544425566?check_suite_focus=true#step:5:83 https://github.com/WaterfoxCo/Waterfox/runs/7543708504?check_suite_focus=true#step:5:436

We'll get this issue every week, as we push a lot of builds out. No rhyme or reason to when it occurs.

A lot are missing because I ended up deleting the workflow logs - there were just too many.

MrAlex94 avatar Jul 27 '22 16:07 MrAlex94

we are heavy users of BuildJet and this action and this is probably the most common issues we face.

We're also users of BuildJet, bless their offerings, and had to disable caching in some of our jobs because of this.

mfn avatar Sep 27 '22 08:09 mfn

This has consequences in incurring additional runner costs, if not timeout limited.

There is a newer issue opened https://github.com/actions/cache/issues/810, and an environment variable SEGMENT_DOWNLOAD_TIMEOUT_MIN was introduced https://github.com/actions/cache/issues/810#issuecomment-1211766031 which was then later reported as not working https://github.com/actions/cache/issues/810#issuecomment-1258700696.

Just happened on a github-hosted runner, which failed to download 3GB during 1 hour (I was not aware of the issue so didn't have any remediation settings), by slowing down the download more and more, https://github.com/marcindulak/jhalfs-ci/actions/runs/3210575787/jobs/5249037125

...
Received 2143289343 of 3843618074 (55.8%), 0.6 MBs/sec
Warning: Failed to restore: Aborting cache download as the download time exceeded the timeout.
Cache not found for input keys: something

"Re-run failed" executed several minutes after the original failure, proceeded without the cache download slowdown, and finished the cache download in 2 minutes.

...
Received 3843618074 of 3843618074 (100.0%), 103.4 MBs/sec
Cache Size: ~3666 MB (3843618074 B)

marcindulak avatar Oct 08 '22 18:10 marcindulak

i've also had this happen sporadically on github hosted runners and pretty consistently on buildjet runners

r-medina avatar Nov 10 '22 21:11 r-medina

We now use uses: actions/cache@v3 with that timeout suggested in https://github.com/actions/cache/issues/720#issuecomment-1030412967 .

I cannot recall having seen the issue (builds failing due to this timeout) in the last while, perhaps last few months.

Never mind. The issue appeared today, with cache@v3. Luckily our macOS build timed out at this step after 10min rather than the full 60min we allow for the entire workflow.

CallThemAndSayHello avatar Nov 11 '22 12:11 CallThemAndSayHello

We see this with actions/cache@v3 all the time, unless you go and add a short timeout to every cache definition jobs back up waiting forever.

image

sammcj avatar Dec 07 '22 03:12 sammcj

This has become a significant issue for us. Currently around 50% of our actions time out at ~98%, which is particularly painful because our download is relatively large (850 MB), so we have to wait a while before checking if builds time out. We will have to migrate to a different solution if we don't find a resolution, as we cannot rely on manually monitoring CI runs.

definelicht avatar Dec 14 '22 09:12 definelicht

In the cases where this problem occurred for us, it seemed to be the case that restoring from the cache, according to the logs, went 0% to say 60% very quickly and then stalled or trickled the rest of the way. I never saw one like the one shown in the screenshot https://user-images.githubusercontent.com/862951/206079348-b522b45f-9fde-404c-aa64-0c64cd335dd0.png actually complete. I had the impression if it was going to complete it was going to do it quickly and right away. Once it started reporting that it was going much more slowly, no amount of waiting would have sufficed. Thus we looked at what our restores were taking and set timeouts accordingly. Basically either the restore is finished with the 2min (for example) or it will not finish (it will not take 30min and if it starts taking such a long time, it is one that hangs).

(basically once it stops making progress, it will not complete)

CallThemAndSayHello avatar Dec 14 '22 10:12 CallThemAndSayHello

We now use uses: actions/cache@v3 with that timeout suggested in #720 (comment) .

This workaround does not work for us, it doesn't seem to be a supported argument (perhaps because it's in the step of an action, rather than a workflow):

Error: /.github/actions/foo/action.yml (Line: 38, Col: 5): Unexpected value 'timeout-minutes'
Error: GitHub.DistributedTask.ObjectTemplating.TemplateValidationException: The template is not valid. /runner/_work/foo/foo/./.github/actions/foo/action.yml (Line: 38, Col: 5): Unexpected value 'timeout-minutes'

definelicht avatar Dec 14 '22 12:12 definelicht

expanding the example:

name: Caching Primes

on: push

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v3

    - name: Cache Primes
      id: cache-primes
      uses: actions/cache@v3
      timeout-minutes: 10                   # <---------
      with:
        path: prime-numbers
        key: ${{ runner.os }}-primes

    - name: Generate Prime Numbers
      if: steps.cache-primes.outputs.cache-hit != 'true'
      run: /generate-primes.sh -d prime-numbers

    - name: Use Prime Numbers
      run: /primes.sh -d prime-numbers

CallThemAndSayHello avatar Dec 14 '22 13:12 CallThemAndSayHello

@definelicht

We have added an environment variable SEGMENT_DOWNLOAD_TIMEOUT_MINS to timeout the download after their segment download crosses a certain amount of time. You can refer the readme for more info.

kotewar avatar Jan 05 '23 06:01 kotewar

@adamshiervani The workflow you pointed to ran 18 days ago. Can you share anything which happened in the last week. 1 point of slowness might be an outlier in which case an upgrade/downtime might have happened? If this happens every few days, I'd like to know more about the frequency.

@vsvipul this happens to us pretty often, at least a few times a week. We also use Buildjet. One case where we recently ran into this was the setup-node step at this build, if it helps you debug: https://github.com/stainless-api/stainless/actions/runs/4103099342/jobs/7076935062 . In that worfklow run, it happened in at least two builds. This is another one https://github.com/stainless-api/stainless/actions/runs/4103099342/jobs/7076927762

rattrayalex avatar Feb 06 '23 16:02 rattrayalex

👋🏽 @rattrayalex, are these private repos? unfortunately we can't see the logs. Can you please elaborate on the issue you are facing.

kotewar avatar Feb 07 '23 09:02 kotewar

Ah ok. Basically the network hangs for many minutes (often long enough to hit our 20min timeout) - same as the other screenshots posted here.

Yesterday this was happening on almost every PR for us.

I'd recommend you reach out to @adamshiervani for more details as they've done some investigations of this at buildjet.

rattrayalex avatar Feb 07 '23 12:02 rattrayalex

Also noticing this behavior while using buildjet: https://github.com/FuelLabs/fuel-core/actions/runs/4179573798/jobs/7239680006

Using the rust-cache action sporadically takes several factors of time longer than a clean build. On average this makes self-hosted runners perform worse overall than using basic github runners (even if the actual execution stage is much faster).

Voxelot avatar Feb 15 '23 01:02 Voxelot

Same here: https://github.com/mlflow/mlflow/actions/runs/4179427251/jobs/7239362333

2023-02-15T00:40:21.2445589Z ##[group]Run actions/cache@v3
2023-02-15T00:40:21.2445836Z with:
2023-02-15T00:40:21.2446061Z   path: ~/.cache/pip
2023-02-15T00:40:21.2446490Z   key: ubuntu22-20230206.1-pip-13695570156a8b8ebf8bca8ca347ebfc4fcff41ffe45d4d97e95f8e36eaf2f63
2023-02-15T00:40:21.2447046Z   restore-keys: ubuntu22-20230206.1-pip-

2023-02-15T00:40:21.2447338Z   enableCrossOsArchive: false
2023-02-15T00:40:21.2447609Z   fail-on-cache-miss: false
...
2023-02-15T00:40:21.2451931Z ##[endgroup]
2023-02-15T00:40:27.6090886Z Received 167772160 of 4156517911 (4.0%), 159.8 MBs/sec
2023-02-15T00:40:28.6050840Z Received 444596224 of 4156517911 (10.7%), 211.7 MBs/sec
2023-02-15T00:40:29.6171696Z Received 713031680 of 4156517911 (17.2%), 225.5 MBs/sec
2023-02-15T00:40:30.6215322Z Received 1015021568 of 4156517911 (24.4%), 240.8 MBs/sec
2023-02-15T00:40:31.6228063Z Received 1279262720 of 4156517911 (30.8%), 243.0 MBs/sec
2023-02-15T00:40:32.6320950Z Received 1593835520 of 4156517911 (38.3%), 252.1 MBs/sec
2023-02-15T00:40:33.6329582Z Received 1849688064 of 4156517911 (44.5%), 250.9 MBs/sec
2023-02-15T00:40:34.6337576Z Received 2059403264 of 4156517911 (49.5%), 244.5 MBs/sec
2023-02-15T00:40:42.4483316Z Received 2147483647 of 4156517911 (51.7%), 129.2 MBs/sec
2023-02-15T00:40:43.4502056Z Received 2336227327 of 4156517911 (56.2%), 132.2 MBs/sec
2023-02-15T00:40:44.4515849Z Received 2533359615 of 4156517911 (60.9%), 135.4 MBs/sec
2023-02-15T00:40:45.4518907Z Received 2726297599 of 4156517911 (65.6%), 137.9 MBs/sec
2023-02-15T00:40:46.4524639Z Received 2940207103 of 4156517911 (70.7%), 141.3 MBs/sec
2023-02-15T00:40:47.4536620Z Received 3112173567 of 4156517911 (74.9%), 142.3 MBs/sec
2023-02-15T00:40:48.4555027Z Received 3288334335 of 4156517911 (79.1%), 143.5 MBs/sec
2023-02-15T00:40:49.4555218Z Received 3481272319 of 4156517911 (83.8%), 145.3 MBs/sec
2023-02-15T00:40:50.4570309Z Received 3653238783 of 4156517911 (87.9%), 146.0 MBs/sec
2023-02-15T00:40:51.4567892Z Received 3837788159 of 4156517911 (92.3%), 147.3 MBs/sec
2023-02-15T00:40:52.4566377Z Received 4001366015 of 4156517911 (96.3%), 147.6 MBs/sec
2023-02-15T00:40:53.4585354Z Received 4152323607 of 4156517911 (99.9%), 147.4 MBs/sec
2023-02-15T00:40:54.4606220Z Received 4152323607 of 4156517911 (99.9%), 142.1 MBs/sec
2023-02-15T00:40:55.4619377Z Received 4152323607 of 4156517911 (99.9%), 137.2 MBs/sec
2023-02-15T00:40:56.4631692Z Received 4152323607 of 4156517911 (99.9%), 132.6 MBs/sec
2023-02-15T00:40:57.4645638Z Received 4152323607 of 4156517911 (99.9%), 128.3 MBs/sec
2023-02-15T00:40:58.4666590Z Received 4152323607 of 4156517911 (99.9%), 124.3 MBs/sec
2023-02-15T00:40:59.4672193Z Received 4152323607 of 4156517911 (99.9%), 120.5 MBs/sec
2023-02-15T00:41:00.4690192Z Received 4152323607 of 4156517911 (99.9%), 116.9 MBs/sec
2023-02-15T00:41:01.4702672Z Received 4152323607 of 4156517911 (99.9%), 113.6 MBs/sec
2023-02-15T00:41:02.4720360Z Received 4152323607 of 4156517911 (99.9%), 110.4 MBs/sec
2023-02-15T00:41:03.4732382Z Received 4152323607 of 4156517911 (99.9%), 107.4 MBs/sec
2023-02-15T00:41:04.4747315Z Received 4152323607 of 4156517911 (99.9%), 104.6 MBs/sec

harupy avatar Feb 15 '23 02:02 harupy

I'm now seeing this on github's larger hosted runners as well:

Run actions/setup-node@v3
Found in cache @ /opt/hostedtoolcache/node/16.19.0/x64
Environment details
/usr/local/bin/yarn --version
1.22.19
/usr/local/bin/yarn cache dir
/home/runner/.cache/yarn/v6
Received 243269632 of 424308030 (57.3%), 231.3 MBs/sec
Received 420113726 of 424308030 (99.0%), 199.8 MBs/sec
…
Received 420113726 of 424308030 (99.0%), 0.3 MBs/sec
Received 420113726 of 424308030 (99.0%), 0.3 MBs/sec
Received 420113726 of 424308030 (99.0%), 0.3 MBs/sec
Error: The operation was canceled.

We've had this happen a handful of times recently. This job was https://github.com/stainless-api/stainless/actions/runs/4232471756/jobs/7352268632 if that's useful to the GHA team.

rattrayalex avatar Feb 21 '23 18:02 rattrayalex

We're seeing it on normal GitHub hosted runners now.

sammcj avatar Feb 21 '23 19:02 sammcj

@sammcj , we are aware of this problem in the GitHub hosted runners and are investigating it thoroughly.

kotewar avatar Feb 22 '23 06:02 kotewar

@adamshiervani The workflow you pointed to ran 18 days ago. Can you share anything which happened in the last week. 1 point of slowness might be an outlier in which case an upgrade/downtime might have happened? If this happens every few days, I'd like to know more about the frequency.

We currently use BuildJet and see this fairly regularly. Recent example here: https://github.com/figurerobotics/project-x/actions/runs/4389045495/jobs/7687049171

michael-projectx avatar Mar 13 '23 16:03 michael-projectx