cache icon indicating copy to clipboard operation
cache copied to clipboard

Action can wait for at least 27 minutes when no progress is being made on the download

Open kevinlul opened this issue 2 years ago • 34 comments

Example: https://github.com/ccmbioinfo/osmp/runs/6692774526?check_suite_focus=true

This hung on Received 1189052508 of 1193246812 (99.6%) for almost half an hour before we noticed and cancelled the workflow run.

kevinlul avatar Jun 01 '22 16:06 kevinlul

I have this errors multiple times by day... my first thought was the cache size that was around 1.5gb... but the problem still happend even with a cache of about 150mb

image

fl250144 avatar Jun 02 '22 23:06 fl250144

@kevinlul @fl250144 Please provide full logs with step debugging enabled so that we can debug this further. Thank you.

vsvipul avatar Jun 05 '22 17:06 vsvipul

Hello, unfortunately the bug was not reproducible, as it cleared up when we cancelled the workflow and reran it. I think the network issues that may cause this bug (e.g. GitHub API hanging) would be difficult for us to reproduce.

kevinlul avatar Jun 06 '22 14:06 kevinlul

@kevinlul Please keep the step debugging on and provide the logs if it reoccurs anytime soon.

vsvipul avatar Jun 06 '22 15:06 vsvipul

37.zip @vsvipul here it is this only one example, i have more if you need.

fl250144 avatar Jun 08 '22 12:06 fl250144

We have experienced this multiple times as well.

Kurt-von-Laven avatar Jun 12 '22 00:06 Kurt-von-Laven

We have this issue as well if we use self hosted runners hosted in europe. At the same time usiing the Github provided runners I don't get this :/

Anybody else who's getting this using own runners?

genisd avatar Jun 13 '22 13:06 genisd

I encountered this behavior now as well multiple times (on github provided runners). The first time the cache download was stuck for 6 hours. As it happens only rarely non-deterministic I can't provide any debug logs nether the link to the workflow as it's a private repo.

For me this occurred during using the cache feature of actions/setup-node which internally uses actions/cache.

Logs
Run actions/setup-node@v3
with:
  node-version-file: .nvmrc
  cache: yarn
  always-auth: false
  check-latest: false
  token: ***
Resolved .nvmrc as 16
Found in cache @ /opt/hostedtoolcache/node/16.15.0/x64
/usr/local/bin/yarn --version
1.22.18
/usr/local/bin/yarn cache dir
/home/runner/.cache/yarn/v6
Received 0 of 1585939639 (0.0%), 0.0 MBs/sec
Received 83886080 of 1585939639 (5.3%), 40.0 MBs/sec
Received 222298112 of 1585939639 (14.0%), 70.7 MBs/sec
Received 348127232 of 1585939639 (22.0%), 83.0 MBs/sec
Received 494927872 of 1585939639 (31.2%), 94.3 MBs/sec
Received 612368384 of 1585939639 (38.6%), 97.2 MBs/sec
Received 713031680 of 1585939639 (45.0%), 97.0 MBs/sec
Received 817889280 of 1585939639 (51.6%), 97.4 MBs/sec
Received 910163968 of 1585939639 (57.4%), 96.3 MBs/sec
Received 1002438656 of 1585939639 (63.2%), 95.5 MBs/sec
Received 1107296256 of 1585939639 (69.8%), 95.9 MBs/sec
Received 1199570944 of 1585939639 (75.6%), 95.2 MBs/sec
Received 1296039936 of 1585939639 (81.7%), 95.0 MBs/sec
Received 1396703232 of 1585939639 (88.1%), 95.0 MBs/sec
Received 1497366528 of 1585939639 (94.4%), 95.1 MBs/sec
Received 1577551031 of 1585939639 (99.5%), 93.9 MBs/sec
Received 1577551031 of 1585939639 (99.5%), 88.4 MBs/sec
Received 1577551031 of 1585939639 (99.5%), 83.5 MBs/sec
Received 1577551031 of 1585939639 (99.5%), 79.1 MBs/sec
Received 1577551031 of 1585939639 (99.5%), 75.1 MBs/sec
Received 1577551031 of 1585939639 (99.5%), 71.6 MBs/sec
Received 1577551031 of 1585939639 (99.5%), 68.3 MBs/sec
Received 1577551031 of 1585939639 (99.5%), 65.3 MBs/sec
Received 1577551031 of 1585939639 (99.5%), 62.6 MBs/sec
Received 1577551031 of 1585939639 (99.5%), 60.1 MBs/sec
[.... truncated, thousand lines and several hours later...] 
Received 1577551031 of 1585939639 (99.5%), 0.1 MBs/sec
Received 1577551031 of 1585939639 (99.5%), 0.1 MBs/sec
Received 1577551031 of 1585939639 (99.5%), 0.1 MBs/sec
Received 1577551031 of 1585939639 (99.5%), 0.1 MBs/sec
Received 1577551031 of 1585939639 (99.5%), 0.1 MBs/sec
Error: The operation was canceled.

As a workaround I added timeout-minutes to the steps so the workflow will at least get canceled automatically.

jontze avatar Jun 27 '22 11:06 jontze

We suspect an issue with the Azure SDK and we've opened an issue for the same here - https://github.com/Azure/azure-sdk-for-js/issues/22321

kotewar avatar Jun 27 '22 12:06 kotewar

This happened to my repo as well. 😞

image

nguyentoanit avatar Jun 28 '22 02:06 nguyentoanit

same happening to our repo as well, self-hosted macOS arm64 , Cache Size: ~786 MB (823760266 B), action repository 'actions/cache@v3' (SHA:c3f1317a9e7b1ef106c153ac8c0f00fed3ddbc0d) Screen Shot 2022-06-28 at 16 20 03

ichina avatar Jun 29 '22 03:06 ichina

Same problem here - gets stuck at 99.9% randomly.

Received 3103784959 of 3165194974 (98.1%), 28.0 MBs/sec
Received 3145727999 of 3165194974 (99.4%), 28.1 MBs/sec
Received 3161000670 of 3165194974 (99.9%), 27.9 MBs/sec
(...more lines with speed going down to 0.1...)
Received 3161000670 of 3165194974 (99.9%), 0.1 MBs/sec
Received 3161000670 of 3165194974 (99.9%), 0.1 MBs/sec
Error: The operation was canceled.

ghost avatar Jul 16 '22 15:07 ghost

Hi all,

We have released a new actions/cache version v3.0.7 that doesn't run the stuck runner for 6 hours, instead aborts the download and gets out of the cache restore step in case the download doesn't complete within an hour. Users using actions/cache@v3 need not do any change and should see this behaviour in their action runs starting today.

This has been done as an intermittent fix as we are still waiting for the azure team to release the fix for the same. https://github.com/Azure/azure-sdk-for-js/issues/22321

kotewar avatar Aug 11 '22 09:08 kotewar

This is a big improvement; thank you! Is there any way to configure the timeout? 1 hour still hurts when our workflow typically runs in ~2.5 minutes.

Kurt-von-Laven avatar Aug 11 '22 18:08 Kurt-von-Laven

Hi @Kurt-von-Laven 👋🏽 For now we've set it to one hour as we didn't want any customer to face issue in case their self hosted runner runs on a low network bandwidth. As an improvement on top of this, we're looking for options to make this better and configuring timeout is one of the options.

We'll get back regarding this soon. Thanks for your patience. 😊

kotewar avatar Aug 12 '22 05:08 kotewar

Hi everyone!

We have introduced an environment variable SEGMENT_DOWNLOAD_TIMEOUT_MINS to configure the timeout for the download of the cache segment. This way you won't have to wait for 1 hour anymore!

This feature is live and available within actions/cache@v3. No version change needed.

Hope it helps! 😊

kotewar avatar Aug 22 '22 15:08 kotewar

@kotewar is there a reason this was done as an environment variable rather than a job input (with)? it's a little awkward to feed environment variables into uses: unless I'm missing something obvious

asottile-sentry avatar Aug 22 '22 16:08 asottile-sentry

@kotewar is there a reason this was done as an environment variable rather than a job input (with)? it's a little awkward to feed environment variables into uses: unless I'm missing something obvious

Actually you can set an environment variable for one specific step by using env: (in this case for the actions/cache@v3 step). An example can be found here: https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepsenv

jontze avatar Aug 22 '22 17:08 jontze

Thanks @jontze :)

@asottile-sentry, we are still investigating the root cause of this issue and waiting for Azure team to release a fix. Due to this, we haven't got a clarity yet on why the download would get stuck intermittently, so we don't know if we would need this timeout feature forever or only till we fix it.

So while the investigation is in process, we have introduced this option to make it easy for everyone to abort the download with a cache miss. This way runners wouldn't hang for 6 hours without any progress. We'll add this field to inputs in the upcoming version of actions/cache in case it needs to stay forever.

As an action item, I'll update the documentation to show the usage of this environment variable so that it helps other users.

kotewar avatar Aug 22 '22 17:08 kotewar

afaict (or maybe the schema I'm using isn't up to date) env isn't supported on uses actions in composite actions (?) so it makes this a tiny bit tricky

asottile-sentry avatar Aug 22 '22 19:08 asottile-sentry

@asottile-sentry, when authoring a composite action, inputs must be mapped into inner steps using GitHub Actions expressions.

So in your case try using something like -

   - uses: actions/cache@v3
      name: "Cache"
      with:
        path: ${DIR}
        key: complex-cache
      env:
        SEGMENT_DOWNLOAD_TIMEOUT_MINS: 15

kotewar avatar Aug 23 '22 09:08 kotewar

hmmm I'm seeing a situation where SEGMENT_DOWNLOAD_TIMEOUT_MIN seems to not have worked

here's the workflow run: https://github.com/getsentry/sentry/actions/runs/3113974617/jobs/5049272594

it's a little obscured but the environment variable is set at the workflow level here: https://github.com/getsentry/sentry/blob/1ba48063464c06aea8f0a02b7fe797bfb6cab715/.github/workflows/frontend.yml#L15-L17

then a composite action is called here: https://github.com/getsentry/sentry/blob/1ba48063464c06aea8f0a02b7fe797bfb6cab715/.github/workflows/frontend.yml#L146

and then that composite action calls actions/cache here: https://github.com/getsentry/sentry/blob/1ba48063464c06aea8f0a02b7fe797bfb6cab715/.github/actions/setup-volta/action.yml#L13

the output of the job with the relevant section:

2022-09-23T15:50:17.5651366Z ##[group]Run actions/cache@56046cbc4743437ac40542086317b1561d7705f8
2022-09-23T15:50:17.5651651Z with:
2022-09-23T15:50:17.5651888Z   path: /home/runner/.cache/yarn/v6
2022-09-23T15:50:17.5652379Z   key: linux-volta-1.0.8-node-16.13.1-yarn-1.22.5-64094b67dee5167fb551a5fbe8757dd24aa02132dea9048f118c4c4d2f5ea311
2022-09-23T15:50:17.5652805Z env:
2022-09-23T15:50:17.5653046Z   SEGMENT_DOWNLOAD_TIMEOUT_MIN: 3
2022-09-23T15:50:17.5653322Z   VOLTA_HOME: /home/runner/.volta
2022-09-23T15:50:17.5653576Z ##[endgroup]
2022-09-23T15:50:19.2224966Z Received 0 of 180889536 (0.0%), 0.0 MBs/sec
2022-09-23T15:50:20.2230569Z Received 83886080 of 180889536 (46.4%), 39.9 MBs/sec
2022-09-23T15:50:21.2226308Z Received 176695232 of 180889536 (97.7%), 56.1 MBs/sec
2022-09-23T15:50:22.2239739Z Received 176695232 of 180889536 (97.7%), 42.1 MBs/sec
2022-09-23T15:50:23.2253786Z Received 176695232 of 180889536 (97.7%), 33.7 MBs/sec
2022-09-23T15:50:24.2268647Z Received 176695232 of 180889536 (97.7%), 28.1 MBs/sec
2022-09-23T15:50:25.2280039Z Received 176695232 of 180889536 (97.7%), 24.0 MBs/sec
2022-09-23T15:50:26.2289787Z Received 176695232 of 180889536 (97.7%), 21.0 MBs/sec
2022-09-23T15:50:27.2297324Z Received 176695232 of 180889536 (97.7%), 18.7 MBs/sec
2022-09-23T15:50:28.2309270Z Received 176695232 of 180889536 (97.7%), 16.8 MBs/sec
2022-09-23T15:50:29.2313534Z Received 176695232 of 180889536 (97.7%), 15.3 MBs/sec
2022-09-23T15:50:30.2327106Z Received 176695232 of 180889536 (97.7%), 14.0 MBs/sec
2022-09-23T15:50:31.2340917Z Received 176695232 of 180889536 (97.7%), 12.9 MBs/sec
2022-09-23T15:50:32.2356730Z Received 176695232 of 180889536 (97.7%), 12.0 MBs/sec
2022-09-23T15:50:33.2368999Z Received 176695232 of 180889536 (97.7%), 11.2 MBs/sec
2022-09-23T15:50:34.2373159Z Received 176695232 of 180889536 (97.7%), 10.5 MBs/sec
2022-09-23T15:50:35.2386669Z Received 176695232 of 180889536 (97.7%), 9.9 MBs/sec
...

2022-09-23T16:41:49.0253357Z Received 176695232 of 180889536 (97.7%), 0.1 MBs/sec
2022-09-23T16:41:50.0265697Z Received 176695232 of 180889536 (97.7%), 0.1 MBs/sec
2022-09-23T16:41:51.0279084Z Received 176695232 of 180889536 (97.7%), 0.1 MBs/sec
2022-09-23T16:41:52.0291969Z Received 176695232 of 180889536 (97.7%), 0.1 MBs/sec
2022-09-23T16:41:53.0077523Z ##[error]The operation was canceled.

I expect my setting of the variable to 3 to have it time out at 3 minutes

asottile-sentry avatar Sep 26 '22 18:09 asottile-sentry

I also just noticed an instance of SEGMENT_DOWNLOAD_TIMEOUT_MIN not being respected: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/3129583553/jobs/5078874393#step:4:1

djaglowski avatar Sep 26 '22 18:09 djaglowski

I noticed the same thing: I set SEGMENT_DOWNLOAD_TIMEOUT_MIN: 5 but its still running after hours instead of timing out after 5 min like I would expect.

https://github.com/vercel/vercel/blob/0e4124f94c94ba7774a8c3ed73d3c0962f70b847/.github/workflows/test.yml#L69-L74

https://github.com/vercel/vercel/actions/runs/3130757219/jobs/5081381465

styfle avatar Sep 26 '22 22:09 styfle

Thanks for reporting this. @asottile-sentry, @djaglowski, @styfle.

Did this issue suddenly started popping up this week? and was the segment timeout working fine until the last week?

kotewar avatar Sep 27 '22 04:09 kotewar

we've been running identical code (we pin github actions to a sha) so I doubt it has changed. I've never seen it successfully time out a run but github actions also doesn't give us enough insight into runs to know whether it's worked in the past or not

asottile-sentry avatar Sep 27 '22 13:09 asottile-sentry

I can't say for sure that I've seen it work.

Speculating - it may well be that the env var never worked for us and that we just hadn't noticed until now. The new default behavior is such an improvement over the old (1 hour timeout w/ continuation vs 6 hour timeout w/ termination), that this is no longer a constant and obvious problem for us.

djaglowski avatar Sep 27 '22 13:09 djaglowski

Also, there was a period before the environment variable was released where the Azure service the cache action relies on had issues, drawing attention to this issue and motivating the introduction of the environment variable. To my knowledge, there has been no such widespread incident since then. We have experienced only one hang since that time, which I unfortunately cancelled manually.

Kurt-von-Laven avatar Sep 27 '22 13:09 Kurt-von-Laven

Worth checking if Azure sdk change was shipped which may break the env workaround. The preview of sdk mail as done earlier this month. Hopefully they didn't GA or we didn't pick up the preview release.

bishal-pdMSFT avatar Sep 27 '22 16:09 bishal-pdMSFT

There was a suggestion from Feb 2022 https://github.com/actions/cache/issues/720#issuecomment-1030412967 about using a step timeout-minutes, before the SEGMENT_DOWNLOAD_TIMEOUT_MIN was introduced.

    - name: Cache something
      uses: actions/cache@v2
      timeout-minutes: 5
      ...

marcindulak avatar Oct 08 '22 18:10 marcindulak