runner icon indicating copy to clipboard operation
runner copied to clipboard

Two self-hosted runners incorrectly run the same job

Open DolceTriade opened this issue 2 years ago • 5 comments

Describe the bug If you launch a multiple of workflows that run on selfhosted runners, sometimes, two runners will run the same job. What this means is that if we have self hosted Runner A and Runner B and Workflow A and Workflow B, Runner A and Runner B will both run Workflow A leaving no runners to run Workflow B.

The GHA UI only shows the output from one of the runners, if looking at the logs on the other runner, we can see it is also running Workflow A and Workflow B remains queued.

To Reproduce This doesn't happen all the time, but it definitely happens at least once every ~50 attempts Steps to reproduce the behavior:

  1. Launch multiple workflows that run on self hosted runners
  2. Watch two self hosted runners pick up and run the same workflow.

Expected behavior Only one runner picks up one workflow.

Runner Version and Platform

Version of your runner? v2.311.0

OS of the machine running the runner? OSX/Windows/Linux/... Linux x86_64

DolceTriade avatar Nov 28 '23 05:11 DolceTriade

Hi, could you please clarify whether separate runners are running the same job in the same workflow run, or if different runners are running the same job in two different workflow runs.

Depending on what triggers you have on a workflow its very possible that you have multiple instances of Workflow A running at the same time, which would lead to the same job in workflow A being run by different runners. To prevent this you can use concurrency to ensure only 1 job is run at a time across all workflow runs for workflow A.

nicholasbergesen avatar Feb 19 '24 07:02 nicholasbergesen

Separate runners are running the same job in the same workflow run (ie, the same job is run twice, simultaneously)

DolceTriade avatar Feb 20 '24 05:02 DolceTriade

Could you please share some more information like workflow triggers and the runner logs where the same job is picked up by multiple runners.

nicholasbergesen avatar Mar 01 '24 13:03 nicholasbergesen

An example workflow might be something like:

---
name: 'CI'

"on": push

permissions:
  contents: write
  id-token: write


concurrency:
  group: '${{ github.workflow }} @ ${{ github.event.pull_request.head.label || github.head_ref || github.ref }}'
  cancel-in-progress: true

jobs:
  ci:
    runs-on: ["self-hosted", "bazel", "nix"]
    steps:
      - name: Checkout
        uses: actions/checkout@v4
      - uses: DeterminateSystems/nix-installer-action@main
      - uses: DeterminateSystems/magic-nix-cache-action@main
      - uses: HatsuneMiku3939/direnv-action@v1
      - run: bazel test //...

Unfortunately I don't have logs, but nothing in the logs pointed to anything suspicious. Both runners had messages like "Running job ci".

In the end, we worked around this problem by:

  • including the runner id in the tags (ie, adding tags: ["self-hosted", "bazel", "nix", "${{ github.runner_id }}, "workflow-name"]
  • Propagating these tags to the JIT runner to ensure that only the job that was provisioned for it can run on the runner, which is unfortunate, but it works well enough.

Again, we didn't do anything special.

DolceTriade avatar Mar 02 '24 06:03 DolceTriade

Do you also see the steps duplicated? I think we are also experiencing this when under load on GHES 3.13 Image

joaopedrocg27 avatar Jan 30 '25 07:01 joaopedrocg27

Just adding my 2ct since I can reproduce this pretty much on each test run on a large scale set.

We have a set of 140 Windows 11 (full desktop) VMs on Azure, which spin up ephemeral runners to pick up a job from a scale set. The jobs are created via a matrix-strategy, which breaks the test run into 200-ish chunks. We have some custom tool that reads the scale set status, starts/stops the VMs and configures the runners (this is based off ARC). The runners pick up jobs from the scale set directly - we do not interfere with that process, it's entirely up to GH to assign jobs to runners.

1-in-50 jobs gets assigned to two runners, causing a few issues downstream when we consolidate the test results.

Re prior questions: runner version is 2.325.0, running on Windows 11. As above, these test steps show up twice: Image

These are jobs of the same workflow run. The "Set up job" step shows that these indeed ran on different VMs. Our VM names in this case are "ui-tests". The runner name is "ui-tests-<8 random characters>", the scale set name it "ui-tests".

Image

The timing is not terribly close in this case - the second runner is assigned to the job almost a full minute after the first runner.

I'm happy to help debugging this - if someone from the GH team gets in touch.

ThomasMatern avatar Jun 23 '25 04:06 ThomasMatern

One issue is that there is nothing in the runner logs or any other file that outputs the workflow job ID that is running. I think this would help understand if the same job is in two runners at the same time

joaopedrocg27 avatar Jul 02 '25 09:07 joaopedrocg27

After looking further into this, I think this is at least partially a problem with our implementation - and it entirely disappeared when adding better error handling on our end.

We are using the scale set message queue endpoint to monitor the runner status. These include "JobStarted" and "JobCompleted" messages, which contain the runner name. All runners are ephemeral, so I would expect only one pair of these messages per runner, and I would also expect the runner to no longer exist after receiving a "JobCompleted" message. This is not the case. Under load, we regularly get a "JobStarted" and a "JobCompleted" message for the same runner in the same message bundle (scale set messages are batched). The "result" field of the "JobCompleted" message is always "canceled" in this case. In our previous implementation, we treated this as an indicator that the runner is now gone, and started a new runner. The runners all have unique names, but they run in the same work folder. This is not an issue if we can make sure that only one runner is active at a time. It seems that after receiving these "JobCompleted" messages, the runner is not always shut down, and sometimes even picks up a different job. When this happens, we end up with two runners active in the same work directory - which creates a lot of misleading logging information, which then confuses the GH UI/workflow view.

Our fix/workaround is simple: after receiving a questionable "JobCompleted" message, we only mark the runner as potentially bad, but we don't take any immediate action. In regular intervals (once a minute), we check the /actions/runners endpoint for active runners, and we wait until those bad runners have disappeared from the list. If this didn't happen after some time (5 minutes), and if the runner didn't pick up a different job, we reboot the VM, which definitely stops that runner.

ThomasMatern avatar Jul 02 '25 20:07 ThomasMatern