runner-images Post Run actions/checkout@v4 failed randomly

Description

Hi,

For two at least two months we have noticed that our nightly runs a problem that occurs randomly. Sometimes the last step which is Post Run actions/checkout@v4 can take a very long time, up to 15 minutes, after which we get the workflow is either skipped or failed.

For skipped we get error massage Hosted runner encountered an error while running your job. (Error type: Disconnect).. Example can be found here - https://github.com/IMGARENA/multisport-fastpath-scoring-app/actions/runs/10765405236

For failed we get error massage Hosted runner: GitHub Actions 94 has lost communication with the server. Anything in the workflow that terminates the runner's process, deprives it of CPU/memory or blocks network access can cause this error. - here you can see an example - https://github.com/IMGARENA/multisport-fastpath-scoring-app/actions/runs/10712726268.

We have added a step in which we monitor CPU and RAM consumption. However, so far the highest CPU consumption has been a maximum of 10% and the available RAM is around 6GB after the tests have been completed. Here you can see our workflow file -> https://github.com/IMGARENA/multisport-fastpath-scoring-app/blob/develop/.github/workflows/run-e2e-tests.yml and workflow for nightly https://github.com/IMGARENA/multisport-fastpath-scoring-app/blob/develop/.github/workflows/nightly-e2e-tests-without-comparator.yml.

Could you be so kind and help us to resolve this issue?

Platforms affected

[ ] Azure DevOps
[X] GitHub Actions - Standard Runners
[ ] GitHub Actions - Larger Runners

Runner images affected

[ ] Ubuntu 20.04
[ ] Ubuntu 22.04
[ ] Ubuntu 24.04
[ ] macOS 12
[ ] macOS 13
[ ] macOS 13 Arm64
[ ] macOS 14
[ ] macOS 14 Arm64
[ ] Windows Server 2019
[X] Windows Server 2022

Image version and build link

Version: 20240908.1.0

Is it regression?

https://github.com/IMGARENA/multisport-fastpath-scoring-app/actions/runs/10821818651

Expected behavior

Post Run actions/checkout@v4 step shouldn't take so much time and should finish successfully

Actual behavior

Post Run actions/checkout@v4 step at the end of the workflow takes sometimes even 15 minutes and then fails or skips the whole workflow.

Repro steps

Go to https://github.com/IMGARENA/multisport-fastpath-scoring-app/actions/workflows/nightly-e2e-tests.yml,
Click on Run workflow button,
Select develop branch,
Click on Run Workflow,

Sep 13 '24 13:09 korrem

Hi @korrem Thank you for bringing this issue to us. We are looking into this issue and will update you on this issue after investigating.

Sep 13 '24 13:09 hemanthmanga

Hi @korrem- I am unable to open the url link which you have provided as it shows '404 error'. However, from your description i can clearly see that the issue you are experiencing with the Post Run actions/checkout@v4 step, which randomly takes a long time or fails due to runner disconnection, could be related to various factors like runner resource limitations, network instability, or GitHub service issues.

For you, i am providing some recommendations to help mitigate the problem:-

A. You can add a retry mechanism to the actions/checkout@v4 step to handle random failures. GitHub Actions supports continue-on-error and retry options to prevent the job from completely failing.

- name: Checkout Code
  uses: actions/checkout@v4
  with:
    fetch-depth: 0
  continue-on-error: true

B. You can also try to check if the runner timeout is set too aggressively. Increasing the runner timeout might prevent early termination. timeout-minutes: 30 # Example to increase timeout if needed C. Adding to this, If the issue persists and is critical, consider using a self-hosted runner with more control over resource allocation and network stability. This might avoid disconnection errors.

D. Git Shallow Clone: To reduce the time spent in the checkout step, ensure that you're not fetching unnecessary history.

- uses: actions/checkout@v4
  with:
    fetch-depth: 1  # Fetch only the latest commit

E. Also, Since the error mentions loss of communication with the server, add network-related logging or monitoring to see if there are spikes in network latency or drops that might be affecting the workflow.

Hopefully, these changes should help improve the stability of the actions/checkout@v4 step.

Sep 19 '24 13:09 Prabhatkumar59

Prabhatkumar59 thanks for your message. I'll try options A and D, and if they don't help then the rest. I will let you know if it helped

Oct 02 '24 10:10 korrem

Hi @korrem - Sure let me know, hopefully those changes which I provided to you should help improve the stability.

Oct 04 '24 16:10 Prabhatkumar59

Hi @korrem - Since we haven't heard back, we'll assume your issue is resolved and will close this issue for now. Feel free to reach out to us for any other queries. Thanks.

Oct 15 '24 09:10 Prabhatkumar59

Hi Prabhatkumar59, Apologies for the long wait with information on the results, unfortunately, none of your advice helped.

The continue-on error and retry mechanism didn’t work.
The timeout-minutes only increased the running time of the entire workflow. The post-run checkout action usually takes a maximum of 2 seconds, but it didn’t help — it only prolonged the agony.
We are blocked by the company from using a self-hosted runner, so this option wasn’t viable.
Setting fetch-dept to 1 also didn’t help.
We added monitoring steps to check if the internet was somehow disconnected, but the steps before and after the tests indicate that everything is correct.
In addition, I have removed unnecessary steps which are usually skippable, but this has not helped either

Strange thing is that I see all steps passed except last one (Post Run actions/checkout@v4) but in logs we are seeing like hosted-runner didn't start entire job LOGS: . Workflow screenshot

I will be grateful for any other help?

Oct 25 '24 13:10 korrem

@korrem , were you able to resolve issue?

Nov 17 '24 13:11 Manmp

Hi @Manmp, unfortunately I wasn't. I still getting prolbem with last step Post run action/checkout@v4. Now it happens everyday in our nightly run, and it is really frustrating.

Dec 04 '24 07:12 korrem