runner The self-hosted runner: xxx lost communication with the server

My issue is the same that was reported on issue 2624. That issue was closed without a solution.

We use AWS Codebuild as the self hosted platform.

This issue is happening on my repository too, frequently. I don't think the EC2 instance is starving and dying because the issue happens in different steps. We use EC2 large (8vCPUs 15GB of memory) to run the workflow.

Sometimes one step completes until the end, sometimes it is aborted in the middle.

The workflow is bellow:

name: Validate Code

on:
  push:
    branches:
      - feature/*
      - bugfix/*
      - docs/*
      - dependabot/**
  merge_group:
    branches:
      - main
  workflow_call:
    inputs:
      since:
        type: string

concurrency:
  group: "validate-code-${{ github.ref }}"
  cancel-in-progress: ${{ inputs.since == '' }}

jobs:
  unit-tests:
    name: Unit Tests (Backend)
    runs-on:
      - codebuild-UniteGithubRunner-${{ github.run_id }}-${{ github.run_attempt }}
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 100
      - name: Fetch main to compare
        if: github.ref != 'refs/heads/main'
        run: git fetch origin main:main --depth=50
      - name: Create swap space
        uses: ./.github/actions/create-swap-space
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - name: Monorepo install
        uses: ./.github/actions/yarn-nm-install
      - name: Run unit tests
        run: yarn workspaces foreach -vvRp --since=${{ inputs.since || 'main' }} --exclude "{agent,appview,authn,canvas,imagine,usage,voicegen,sidekick,taskbuilder}-service" --exclude "{imagine-base,imagine-api,auth-fe,parent-link-fe,resource-fe,ai-hub-fe,artifact-cards,feedback-fe,taskconfig-fe}" run test

  unit-tests-fe:
    name: Unit Tests (Frontend)
    runs-on:
      - codebuild-UniteGithubRunner-${{ github.run_id }}-${{ github.run_attempt }}
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 100
      - name: Fetch main to compare
        if: github.ref != 'refs/heads/main'
        run: git fetch origin main:main --depth=50
      - name: Create swap space
        uses: ./.github/actions/create-swap-space
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - name: Monorepo install
        uses: ./.github/actions/yarn-nm-install
      - name: Run unit tests
        run: yarn workspaces foreach -vvR --since=${{ inputs.since || 'main' }} --include "{agent,appview,authn,canvas,imagine,usage,voicegen,sidekick,taskbuilder}-service" --include "{imagine-base,imagine-api,auth-fe,parent-link-fe,resource-fe,ai-hub-fe,artifact-cards}" run test

  linting:
    name: Linting
    runs-on:
      - codebuild-UniteGithubRunner-${{ github.run_id }}-${{ github.run_attempt }}
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 100
      - name: Fetch main to compare
        if: github.ref != 'refs/heads/main'
        run: git fetch origin main:main --depth=50
      - name: Create swap space
        uses: ./.github/actions/create-swap-space
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - name: Monorepo install
        uses: ./.github/actions/yarn-nm-install
      - name: Lint changed workspaces
        run: yarn workspaces foreach -vvRp --since=${{ inputs.since || 'main' }} run lint
        env:
          # increased memory because we were getting an out of memory error when running lint
          NODE_OPTIONS: "--max-old-space-size=8192"
      - name: Find unused files, dependencies and exports
        run: yarn knip

  type-checking:
    name: Type checking
    runs-on:
      - codebuild-UniteGithubRunner-${{ github.run_id }}-${{ github.run_attempt }}
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 100
      - name: Fetch main to compare
        if: github.ref != 'refs/heads/main'
        run: git fetch origin main:main --depth=50
      - name: Create swap space
        uses: ./.github/actions/create-swap-space
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - name: Monorepo install
        uses: ./.github/actions/yarn-nm-install
      - name: Cache TypeScript build info
        uses: actions/cache@v4
        with:
          path: |
            **/tsconfig.tsbuildinfo
          key: ${{ runner.os }}-tsbuildinfo-${{ hashFiles('tsconfig.base.json', '**/tsconfig.json') }}
          restore-keys: |
            ${{ runner.os }}-tsbuildinfo-
      - name: Type check changed workspaces
        run: yarn workspaces foreach -vvRp --since=${{ inputs.since || 'main' }} run type-check --incremental
        env:
          # increased memory because we were getting an out of memory error when running lint
          NODE_OPTIONS: "--max-old-space-size=8192"

It may be just a coincidence, but when I saw this issue happen, one of the workflows finished with error (it can completely, but finished with error, say, because some unit test failed). Then the other job that was running on the other runner stops executing in the middle. And then in the "Annotations" section I have that message: "The self-hosted runner: b4ac7d30-8387-4499-a899-f75d06e2941f lost communication with the server."

When I go check the logs of that runner on AWS, there is no error message. The build just stops running in the middle.

The error does not happen in a single job. It happens in any of the jobs on that workflow. Some jobs complete successfully, there is one that completes with an error (like unit test, type checking, linting error), and another job that is apparently aborted in the middle

This was the log on the aws runner (for one of the cases this issue hapened, this time, on the linting job):

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   timestamp   |                                                                                                                                                            message                                                                                                                                                            |
|---------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1730730597587 | [Container] 2024/11/04 14:29:52.994950 Running on CodeBuild On-demand                                                                                                                                                                                                                                                         |
| 1730730597587 | [Container] 2024/11/04 14:29:52.994965 Waiting for agent ping                                                                                                                                                                                                                                                                 |
| 1730730597587 | [Container] 2024/11/04 14:29:53.095829 Waiting for DOWNLOAD_SOURCE                                                                                                                                                                                                                                                            |
| 1730730597587 | [Container] 2024/11/04 14:29:53.542561 Phase is DOWNLOAD_SOURCE                                                                                                                                                                                                                                                               |
| 1730730597587 | [Container] 2024/11/04 14:29:53.579585 CODEBUILD_SRC_DIR=/codebuild/output/src838862602/src                                                                                                                                                                                                                                   |
| 1730730597587 | [Container] 2024/11/04 14:29:53.579709 YAML location is /codebuild/readonly/buildspec.yml                                                                                                                                                                                                                                     |
| 1730730597587 | [Container] 2024/11/04 14:29:53.581659 Processing environment variables                                                                                                                                                                                                                                                       |
| 1730730597587 | [Container] 2024/11/04 14:29:53.707724 No runtime version selected in buildspec.                                                                                                                                                                                                                                              |
| 1730730597587 | [Container] 2024/11/04 14:29:53.886788 Moving to directory /codebuild/output/src838862602/src                                                                                                                                                                                                                                 |
| 1730730597587 | [Container] 2024/11/04 14:29:53.889496 Unable to initialize cache download: no paths specified to be cached                                                                                                                                                                                                                   |
| 1730730597587 | [Container] 2024/11/04 14:29:54.128661 Configuring ssm agent with target id: codebuild:177c3ec4-a435-4fd8-966c-1d337021976b                                                                                                                                                                                                   |
| 1730730597587 | [Container] 2024/11/04 14:29:54.164327 Successfully updated ssm agent configuration                                                                                                                                                                                                                                           |
| 1730730597587 | [Container] 2024/11/04 14:29:54.164650 Registering with agent                                                                                                                                                                                                                                                                 |
| 1730730597587 | [Container] 2024/11/04 14:29:54.198405 Phases found in YAML: 1                                                                                                                                                                                                                                                                |
| 1730730597587 | [Container] 2024/11/04 14:29:54.198427  BUILD: 1 commands                                                                                                                                                                                                                                                                     |
| 1730730597587 | [Container] 2024/11/04 14:29:54.198642 Phase complete: DOWNLOAD_SOURCE State: SUCCEEDED                                                                                                                                                                                                                                       |
| 1730730597587 | [Container] 2024/11/04 14:29:54.198655 Phase context status code:  Message:                                                                                                                                                                                                                                                   |
| 1730730597587 | [Container] 2024/11/04 14:29:54.265345 Entering phase INSTALL                                                                                                                                                                                                                                                                 |
| 1730730597587 | [Container] 2024/11/04 14:29:54.266546 Phase complete: INSTALL State: SUCCEEDED                                                                                                                                                                                                                                               |
| 1730730597587 | [Container] 2024/11/04 14:29:54.266561 Phase context status code:  Message:                                                                                                                                                                                                                                                   |
| 1730730597587 | [Container] 2024/11/04 14:29:54.307895 Entering phase PRE_BUILD                                                                                                                                                                                                                                                               |
| 1730730597587 | [Container] 2024/11/04 14:29:54.309323 Phase complete: PRE_BUILD State: SUCCEEDED                                                                                                                                                                                                                                             |
| 1730730597587 | [Container] 2024/11/04 14:29:54.309336 Phase context status code:  Message:                                                                                                                                                                                                                                                   |
| 1730730597587 | [Container] 2024/11/04 14:29:54.342911 Entering phase BUILD                                                                                                                                                                                                                                                                   |
| 1730730597587 | [Container] 2024/11/04 14:29:54.342930 Ignoring BUILD phase commands for self-hosted runner build.                                                                                                                                                                                                                            |
| 1730730597587 | [Container] 2024/11/04 14:29:54.378406 Checking if docker is running. Running command: docker version                                                                                                                                                                                                                         |
| 1730730597587 | GHA self-hosted runner build triggered by /actions/runs/11666276634/job/32480845110                                                                                                                                                                                                                                           |
| 1730730597587 | Creating GHA self-hosted runner workspace folder: actions-runner                                                                                                                                                                                                                                                              |
| 1730730597587 | Downloading GHA self-hosted runner binary                                                                                                                                                                                                                                                                                     |
| 1730730597587 |   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                                                                                                                                                                                                                                               |
| 1730730597587 |                                  Dload  Upload   Total   Spent    Left  Speed                                                                                                                                                                                                                                                 |
| 1730730599632 |    0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  18  136M   18 24.9M    0     0  45.9M      0  0:00:02 --:--:--  0:00:02 45.9M  62  136M   62 84.9M    0     0  57.2M      0  0:00:02  0:00:01  0:00:01 57.2M 100  136M  100  136M    0     0  57.0M      0  0:00:02  0:00:02 --:--:-- 57.0M  |
| 1730730601648 | Configuring GHA self-hosted runner                                                                                                                                                                                                                                                                                            |
| 1730730615670 | --------------------------------------------------------------------------------                                                                                                                                                                                                                                              |
| 1730730615670 | |        ____ _ _   _   _       _          _        _   _                      |                                                                                                                                                                                                                                              |
| 1730730615670 | |       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |                                                                                                                                                                                                                                              |
| 1730730615670 | |      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |                                                                                                                                                                                                                                              |
| 1730730615670 | |      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |                                                                                                                                                                                                                                              |
| 1730730615670 | |       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |                                                                                                                                                                                                                                              |
| 1730730615670 | |                       Self-hosted runner registration                        |                                                                                                                                                                                                                                              |
| 1730730615670 | # Authentication                                                                                                                                                                                                                                                                                                              |
| 1730730615670 | √ Connected to GitHub                                                                                                                                                                                                                                                                                                         |
| 1730730617717 | # Runner Registration                                                                                                                                                                                                                                                                                                         |
| 1730730617717 | √ Runner successfully added                                                                                                                                                                                                                                                                                                   |
| 1730730617717 | √ Runner connection is good                                                                                                                                                                                                                                                                                                   |
| 1730730617717 | # Runner settings                                                                                                                                                                                                                                                                                                             |
| 1730730617717 | √ Settings Saved.                                                                                                                                                                                                                                                                                                             |
| 1730730617717 | Running GHA self-hosted runner binary                                                                                                                                                                                                                                                                                         |
| 1730730619730 | √ Connected to GitHub                                                                                                                                                                                                                                                                                                         |
| 1730730619730 | Current runner version: '2.320.0'                                                                                                                                                                                                                                                                                             |
| 1730730619730 | 2024-11-04 14:30:18Z: Listening for Jobs                                                                                                                                                                                                                                                                                      |
| 1730730621746 | 2024-11-04 14:30:19Z: Running job: Linting                                                                                                                                                                                                                                                                                    |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Nov 05 '24 13:11 bruno-zica

We are also getting the same error since couple of days - The self-hosted runner: lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

Nov 08 '24 12:11 nasomu-fadv

I am facing the same error on AWS Codebuild.

Nov 12 '24 02:11 Tomo-Hayasaka

We are facing the same issue

Nov 18 '24 16:11 eilonash92

Nov 18 '24 17:11 Coinsintegrity

+1

with both v2.320.0 and v2.321.0

Nov 18 '24 18:11 asztivagyok

We have also started to encounter this in the past 24 hours

Nov 18 '24 21:11 sayhiben

We started having this issue this morning - upgrading to v2.321.0 seems to have resolved it

Nov 18 '24 23:11 JaredAtRover

We already upgraded to v2.321.0 a few days ago but still run into this issue from time to time.

Nov 19 '24 06:11 timpeeters

We also keep encountering it in the past 3-4 days

Nov 20 '24 12:11 nubcake94

This sounds familiar to what we're experiencing and it started happening this week. We're not using CodeBuild, but runners running on ECS (https://github.com/CloudSnorkel/cdk-github-runners/) - jobs are picked up and completed, but nothing is being reported back to GitHub.

The symptoms started showing up as a hanging step that never terminated and would hit a timeout. Right now, I've taken down all resources and deployed everything from scratch and now nothing is being reported back to GitHub at all. All runs just get stuck at "Waiting for a runner to pick up this job...", but the jobs still run.

We'll try switching back to GitHub hosted runners in the morning and see if we have time to troubleshoot a bit more.

It's super weird with the step that never terminates, we tried to drop in a timeout 120 <command> followed by a retry of the command which then made it recover. I don't see why it would reestablish a network connection by terminating the child process and retrying it - so I think this is unrelated.

We operate in eu-west-1.

Adding a bit more context - to my surprise I woke up with a run that was retriggered automatically 9 hours later - everything was reporting as intended. The logs for both runs are here:

They look somewhat identical (don't mind them being canceled - the run is intended to fail). The first run was never detected as being picked up, but the second run worked as intended.

Screenshot 2024-11-21 at 10 00 23

Nov 20 '24 23:11 martinjlowm

Dec 02 '24 04:12 Coinsintegrity

@martinjlowm Has anything changed for you? Have you resolved the issue?

Dec 02 '24 08:12 yelyzavetamelikhova

We have also been encountering random errors like this for the past 1-2 weeks. The runners are hosted on EC2 with version v2.321.0

Is there a way to downgrade and prevent the runner to auto upgrade?

Edit: I tried --disableupdate to fix the version but all versions before 2.320.0 are deprecated and not usable

Dec 03 '24 09:12 pwo3

same here We have started experience this issue yesterday

Dec 05 '24 18:12 stasmas

We've seen an increase of this issue over the last week as well.

Current runner version: '2.321.0'

Infra Azure Kubernetes Service - 2 node pools (System and Worker)

Worker Node Size
- 32 vcpu, 128 GB Memory (Standard_D32as_v5)
- Autoscaling (5 min, 15 max) Runner Scalesets

Standard Runner resources

resources:
  limits:
    memory: "24Gi"
  requests:
    cpu: "8000m"
    memory: "24Gi"

Docker-in-Docker (runner container)

  resources:
    limits:
      memory: "24Gi"
    requests:
      cpu: "8000m"
      memory: "24Gi"

When we first started having these issues it appeared to be caused by hitting CPU limits for the 8 vCPU nodes we were using at the time. We've since quadrupled our node sizes to 32 vCPU VMs with no cpu limits in the resources configs. Our jobs / workflows are not hitting the VM cpu limit at this point (most are not even close), but the issue is still occurring although seemingly not as frequently. As others have mentioned, there are no errors in the runner logs when the disconnect happens. The only error is the one listed in the workflow after what appears to be a timeout.

Dec 05 '24 21:12 jroskens-mgm

We are still seeing this issue. Has anyone found a fix for this problem?

Dec 11 '24 22:12 kwalker-freewill

Why is there no action by the code owner? This is a real and serious problem. AWS only supports the infrastructure. Action on the GitHub side is needed to solve the problem.

Dec 15 '24 11:12 yasakan

Unless this issue is resolved, GitHub Actions on Codebuild is completely useless and must be considered a failed product.

Dec 15 '24 11:12 yasakan

I may have found one solution.

If you have set Privileged mode to true in a Codebuild project and you are running elasticsearch, mysql, etc. as service containers in a GitHub action, they are consuming a lot of memory and causing OOM, resulting in the runner process stalling OOM. Limiting the memory usage has alleviated the problem.

    services:
      mysql:
        image: "public.ecr.aws/docker/library/mysql:8.0.28"
        ports:
          - 3306:3306
        env:
          TZ: "/usr/share/zoneinfo/Asia/Tokyo"
          MYSQL_ALLOW_EMPTY_PASSWORD: "yes"
        options: --memory-reservation "1024m" --oom-kill-disable=true

It is difficult to know for sure if this is the real cause, as it is not possible to debug the runner in detail.

Dec 19 '24 12:12 Tomo-Hayasaka

We ran into the same problem with a similar configuration. Only we have dynamic runners in EC2. We noticed that the job is trying to run on a runner that has already been destroyed. This runner is defined as offline in the settings.

Jan 16 '25 07:01 AndreyTimoschuk

When running the Runner, how are the Jobs run? I've never written a single C# line, but it looks like the Listener and Runner / Worker are all different binaries right?

What I can see when running, is that the runner "renews job request" every minute:

[2025-02-12 11:40:12Z INFO JobDispatcher] Successfully renew job request 527161, job is valid till 02/12/2025 11:50:12
[2025-02-12 11:41:12Z INFO JobDispatcher] Successfully renew job request 527161, job is valid till 02/12/2025 11:51:12

when doing something CPU intensive (I think), it stops to print this log. It then has 10 minutes to complete the job. If it does, everything is fine, if it takes longer the "log communication with the server" happens.

I still see log output from the running job in the GitHub actions ui, so the actual worker process seems to still be running.

I don't know if the JobDispatcher process dies, or if it is simply starved for CPU/Memory by the Worker....

I have tried running the step with nice -n 15 BIN to give the JobDispatcher higher prio. I guess I'll have to set up better monitoring on the instance to get some more info about what is happening.

Feb 12 '25 12:02 klingenm

I'm having the same problem too. It doesn't matter how many resources are made available to the runner.

Nov 18 '25 18:11 raphael-schub