cdk-github-runners icon indicating copy to clipboard operation
cdk-github-runners copied to clipboard

feat: add additional labels

Open pharindoko opened this issue 2 years ago • 24 comments

Hey @kichik,

we have one issue with runners on org level (I assume as well on repo level). When multiple jobs run in parallel it seems like it can happen that the wrong job gets the runner and some jobs are left over still waiting for a runner.

I tried to add additional labels to the runner labels but have seen that this will be blocked by the webhook script. https://github.com/CloudSnorkel/cdk-github-runners/blob/0fc4be5f4ca524d4fe9f0c5056ff5466c2f89332/src/webhook-handler.lambda.ts#L148

feature-request: I would like to have the possiblity to add additional labels to a workflow. [self-hosted, large-runner, additional-label-for-workflow, additional-label-for-repo, additional-label]

Would mean you only check for required labels [self-hosted, large-runner] if those match and then forward it.

pharindoko avatar Nov 22 '23 09:11 pharindoko

Can you please include the specific labels used? It will be easier to discuss with a concrete example.

kichik avatar Nov 25 '23 16:11 kichik

As mentioned this might occur no matter if the runner are registered on org or repo level. But it will happen more often because more jobs are executed.

Background:

We register the runners now on org level - means those are visible to all repositories the github app is configured for in an organization.

The runner has label [self-hosted, large-runner] We have different workflows in two different repositories A,B in the same org that use these labels.

We trigger workflow A in repository A using workflow dispatch. This workflow A gets queued and queued...

In the meantime another workflow B that listened for a runner already for 15 hours having the label [self-hosted, large-runner] takes over this runner. It can have several reasons why this happens e.g. user missed for organizations to select the runner group setting to support Allow public repositories and has a public repository.

I can see this directly in the cloudwatch logs because in this case the workflow job failed and returned Job FrontendDeployment completed with result: Failed

Job FrontendDeployment is only availabe in repository B for workflow B. Stepfunction succeeded, EC2 terminated... Workflow A still waits for a runner ...

pharindoko avatar Nov 28 '23 08:11 pharindoko

That would be one of the organization level registration cons:

https://github.com/CloudSnorkel/cdk-github-runners/blob/994f474deb38c904d94d2ad824b0493bac09d942/setup/src/App.svelte#L333C1-L334C49

You can bring up more runners to take over the stolen ones by creating a new step function execution. Open an existing one from the UI and hit the New Execution button.

I'm still not sure what you originally meant with the labels. Where would the label be added to resolve this?

kichik avatar Nov 28 '23 16:11 kichik

I assume you can have this issue no matter if it`s org level or repo runner. The propability is just higher on org level. Furthermore you have an issue because as a user you might not see all repos in the org and it`s quite hard to find jobs that got stuck and still listen for a runner if you have a lot of repos in one org.

Here`s a sample for a solution proposal:

  1. Runner in cdk solution created with labels: [label-for-instance-type]
  2. User creates a workflow with [self-hosted, label-for-instance-type, label-for-repository]
  3. Workflow gets triggered and webhook.lambda.ts is triggered with labels [self-hosted, label-for-instance-type, label-for-repository] matchLabelsToProvider checks only if the runner labels are included: ["self-hosted, label-for-instance-type"]
  4. Stepfunction determines a runner based on the labels that match a runner: ["self-hosted, label-for-instance-type"]
  5. All labels [self-hosted, label-for-instance-type, label-for-repository] are injected into runner and attached to the config.sh command.
  6. Runner is registered in github with labels [self-hosted, label-for-instance-type, label-for-repository]

This would mean that runner labels and workflow labels can be different and you can attach additional workflow labels.

I`ll try to create a PR and test this to see if this could work. @kichik please let me know if you have concerns ...

pharindoko avatar Dec 05 '23 10:12 pharindoko

That won't prevent other repositories from using label-for-repository and still stealing the runner. Our label situation is already not the simplest to understand. I think this will make it even harder and still won't solve runner stealing.

kichik avatar Dec 05 '23 18:12 kichik

well it`s not like people steal runners by intention. And the label-for-repository is just a sample for a tag.

You could use multiple tags and choose whatever you want. But it could happen that people just copy / paste those labels from another github action from another repository by mistake. It would still heavily reduce the propability that one runner can steal another one. And you can trace it in the logs as well.

pharindoko avatar Dec 05 '23 21:12 pharindoko

btw if you know a better solution for this dilemma, please let me know :)

pharindoko avatar Dec 05 '23 21:12 pharindoko

But it could happen that people just copy / paste those labels from another github action from another repository by mistake.

Exactly.

Have you considered creating multiple providers all using the same settings and image builder but with different labels? That feels like basically what you're asking for. Each runner will start with exactly the labels you want and for just the repositories you want.

btw if you know a better solution for this dilemma, please let me know :)

Repository level registration is the only thing I found. I do wish GitHub would let us aim specific runners at specific jobs, but they have no such option.

kichik avatar Dec 06 '23 02:12 kichik

But it could happen that people just copy / paste those labels from another github action from another repository by mistake.

Exactly.

Have you considered creating multiple providers all using the same settings and image builder but with different labels? That feels like basically what you're asking for. Each runner will start with exactly the labels you want and for just the repositories you want.

No chance. Too much repos at all. This is no fun even when you have more than 10 repos.

btw if you know a better solution for this dilemma, please let me know :)

Repository level registration is the only thing I found. I do wish GitHub would let us aim specific runners at specific jobs, but they have no such option.

What I struggle with is the administration permission on repo level. A single permission to manage runners on the repository level is still missing.

Another option

Add a schedule feature to start additional runners on a regular base by

  • querying cloudwatch logs to get organizations that used the runner solution e.g. the last 2 weeks and start runners
  • querying the github instance to get all jobs that queued longer than e.g. 6 hours and start runners

The idleTimeout setting will kill those ec2 and deregister the runner by default if no job is taken after 5 minutes, right ?

@kichik please give me feedback I could try to create a draft PR for this topic.

pharindoko avatar Dec 06 '23 08:12 pharindoko

No chance. Too much repos at all. This is no fun even when you have more than 10 repos.

Wouldn't you have the same problem with your proposed solution? You'd have to specify all those labels somewhere. Maybe your CDK code can automatically generate it using gh?

The schedule option is something I have considered often, for many reasons. I am still not sure exactly what I want it to look like or if it should even be a schedule and not some kind of monitoring system for stolen/missing runners.

You can already define a schedule manually, if you need to. I have done something similar for one of my installations.

const sm = runners.node.findChild('Runner Orchestrator') as stepfunctions.StateMachine;
const rule = new events.Rule(this, 'Start Runner Rule', {
  schedule: events.Schedule.rate(cdk.Duration.hours(1)),
  targets: [
    new targets.SfnStateMachine(
      sm, 
      {
        input: events.RuleTargetInput.fromObject({
          "owner": "my-org",
          "repo": "my-repo",
          "jobId": 0,
          "jobUrl": "manual-runner",
          "installationId": xxx, // TODO get from github secret
          "labels": {
            "self-hosted": true,
            "something": true, // TODO update
          },
          "provider": "stack-name/provider-id" // TODO update
        }) 
      }
    )
  ],
});

kichik avatar Dec 06 '23 14:12 kichik

No chance. Too much repos at all. This is no fun even when you have more than 10 repos.

Wouldn't you have the same problem with your proposed solution? You'd have to specify all those labels somewhere. Maybe your CDK code can automatically generate it using gh?

It doesn`t happen all the time right. And for most of the users it`s quite ok how it works even with org runners. But for some of them it`s not acceptable when it`s used for some production deployments. And for those it would be fine if they need to attach additional labels to make it more reliable.

The schedule option is something I have considered often, for many reasons. I am still not sure exactly what I want it to look like or if it should even be a schedule and not some kind of monitoring system for stolen/missing runners.

You can already define a schedule manually, if you need to. I have done something similar for one of my installations.

const sm = runners.node.findChild('Runner Orchestrator') as stepfunctions.StateMachine;
const rule = new events.Rule(this, 'Start Runner Rule', {
  schedule: events.Schedule.rate(cdk.Duration.hours(1)),
  targets: [
    new targets.SfnStateMachine(
      sm, 
      {
        input: events.RuleTargetInput.fromObject({
          "owner": "my-org",
          "repo": "my-repo",
          "jobId": 0,
          "jobUrl": "manual-runner",
          "installationId": xxx, // TODO get from github secret
          "labels": {
            "self-hosted": true,
            "something": true, // TODO update
          },
          "provider": "stack-name/provider-id" // TODO update
        }) 
      }
    )
  ],
});

Hmm that`s a nice idea. But ok I can`t do this statically in cdk code for all repos. I would get lost.

Create schedule to trigger a lambda. Create a lambda to query cloudwatch insights to get the owner, repo and labels aggregated for the last 2 weeks. Go through all rows found and trigger the statemachine. I`ll try. I still would like to have this as part of the cdk construct :)

pharindoko avatar Dec 06 '23 15:12 pharindoko

It doesnt happen all the time right. And for most of the users its quite ok how it works even with org runners. But for some of them its not acceptable when its used for some production deployments. And for those it would be fine if they need to attach additional labels to make it more reliable.

You can create a provider just for those special cases then. Isn't that basically what you were asking for? What am I missing?

Hmm thats a nice idea. But ok I cant do this statically in cdk code for all repos. I would get lost.

Since you're registering on organization level, the repo part doesn't really matter. You can leave it as a placeholder. No need for Lambda or anything.

kichik avatar Dec 06 '23 19:12 kichik

Ok, I found a solution that is quite ok I guess to circumvent this problem:

image Added a construct with a step function and a few lambdas with following workflow:

  1. On the left side I get all the information of the state machine logs about the jobs transmitted (repo, owner, job, labels)
  2. On the right side I query github in a lambda authenticated as github app to get all the organisation installations

  1. Based on the "owner"|"organisation" property I merge the information
  2. I check the jobs and their current state (is the job queued for longer than 10 minutes ?)
  3. I trigger new stepfunction executions to get new runners with the specific labels registered in the right organisation.

I trigger this step function each half an hour.

What would be great to get as output from the GithubRunners construct:

  • the statemachine arn (use to trigger new step function executions
  • the loggroup name (used for the cloudwatch insights query)

Currently this is what I need to execute the construct:

const stateMachine = githubRunners.node.children.find((child) => child instanceof stepfunctions.StateMachine)! as stepfunctions.StateMachine;
const logGroupName = `${this.stackName}-state-machine`;

new GithubScheduledRunnerHandler(this, 'github-scheduled-runner', {
      internalSubnetIds: props.internalSubnetIds,
      vpcId: props.vpcId,
      logGroupName,
      githubSecretId: githubRunners.secrets.github.secretName,
      githubPrivateKeySecretId: githubRunners.secrets.githubPrivateKey.secretName,
      mappingTable: providers.map((provider) => { return { "provider": provider.toString(), "labels": provider.labels } }),
      stateMachineArn: stateMachine.stateMachineArn,
      githubRunnerWaitTimeInMinutes: 10,
    });

pharindoko avatar Jan 04 '24 12:01 pharindoko

Would you be interested to integrate this construct in your solution ? Should I add a PR or should I create a separate construct ? I could share the code in a private repo in a simplified version (not a perfect construct) if you want to get more details. Please let me know @kichik.

pharindoko avatar Jan 04 '24 12:01 pharindoko

I do want something like this one day, but I don't think we have all the details figured out yet. For example, when using ECS with a limited ASG, a job might be pending for >10 minutes just because the cluster is busy. If we keep submitting more jobs, it may only make the situation worse.

I do like that it doesn't go off just by the webhook. I have seen cases where the webhook was just not called and so runners were missing.

kichik avatar Jan 06 '24 02:01 kichik

Please do share the code, if you can. I think I may have misunderstood the matching you were doing between jobs and the existing state machine. It might already cover the problem I mentioned.

kichik avatar Jan 06 '24 15:01 kichik

@kichik hope you have seen my invitation - created a private repo for this.

pharindoko avatar Jan 09 '24 09:01 pharindoko

@kichik hope you have seen my invitation - created a private repo for this.

I did. Thanks.

kichik avatar Jan 09 '24 09:01 kichik

After more than week of running the scheduler step function I found out there a clear correlation between the webhook lambda function error message ExecutionAlreadyExists and new runners triggered by my scheduler step function. It only seems to happen in this case.

Detailed error message in cloudwatch:

__type
com.amazonaws.swf.service.v2.model#ExecutionAlreadyExists
$fault
client
$metadata.attempts
1
$metadata.httpStatusCode
400
$metadata.requestId
f1cad492-8906-4967-9c99-f920a010bf90
$metadata.totalRetryDelay
0
@ingestionTime
1704816064769
@log
********:/aws/lambda/gh-runner-ghrunnergithubrunnerWebhookHandlerwebhoo....
@logStream
2024/01/09/[$LATEST]e18f1646a851483d93a0490663a2cde4
@message
2024-01-09T16:00:56.047Z f8bf04d7-561b-4d7a-bbd0-b66f23121b91 ERROR Invoke Error {"errorType":"ExecutionAlreadyExists","errorMessage":"Execution Already Exists: 'arn:aws:states:********:********:execution:ghrunnergithubrunnerRunnerOrchestrator********:******'-44b33b20-af08-'","name":"ExecutionAlreadyExists","$fault":"client","$metadata":{"httpStatusCode":400,"requestId":"f1cad492-8906-4967-9c99-f920a010bf90","attempts":1,"totalRetryDelay":0},"__type":"com.amazonaws.swf.service.v2.model#ExecutionAlreadyExists","stack":["ExecutionAlreadyExists: Execution Already Exists: 'arn:aws:states:********:*******:execution:ghrunnergithubrunnerRunnerOrchestrator********:*****'"," at de_ExecutionAlreadyExistsRes (/var/runtime/node_modules/@aws-sdk/client-sfn/dist-cjs/protocols/Aws_json1_0.js:1694:23)"," at de_StartExecutionCommandError (/var/runtime/node_modules/@aws-sdk/client-sfn/dist-cjs/protocols/Aws_json1_0.js:1321:25)"," at process.processTicksAndRejections (node:internal/process/task_queues:95:5)"," at async /var/runtime/node_modules/@aws-sdk/middleware-serde/dist-cjs/deserializerMiddleware.js:7:24"," at async /var/runtime/node_modules/@aws-sdk/middleware-signing/dist-cjs/awsAuthMiddleware.js:14:20"," at async /var/runtime/node_modules/@aws-sdk/middleware-retry/dist-cjs/retryMiddleware.js:27:46"," at async /var/runtime/node_modules/@aws-sdk/middleware-logger/dist-cjs/loggerMiddleware.js:7:26"," at async Runtime.handler (/var/task/index.js:13997:21)"]}
@requestId
f8bf04d7-561b-4d7a-bbd0-b66f23121b91
@timestamp
1704816056047

pharindoko avatar Jan 12 '24 10:01 pharindoko

The name uses GitHub webhook delivery id to remain unique.

https://github.com/CloudSnorkel/cdk-github-runners/blob/06ffc53c8067027748a4447d3f3b248fabb94131/src/webhook-handler.lambda.ts#L167

Do you maybe have the same webhook hooked up more than once somehow?

kichik avatar Jan 13 '24 18:01 kichik

The name uses GitHub webhook delivery id to remain unique.

https://github.com/CloudSnorkel/cdk-github-runners/blob/06ffc53c8067027748a4447d3f3b248fabb94131/src/webhook-handler.lambda.ts#L167

Do you maybe have the same webhook hooked up more than once somehow?

Ok then I assume it might be the apigw/lambda retry mechanism... will check it.

pharindoko avatar Jan 13 '24 21:01 pharindoko

I assume that this really only happens when the apigw (my internal customlambdaaccess implementation) does a retry...

seems like this occurs for 1 % of the runner requests and it`s always the same correlation between "the stepfunction alreadyexists message" <> a 502 bad gateway error <> a queued job waiting for a runner in github

pharindoko avatar Feb 07 '24 09:02 pharindoko

Can you tell what's the initial failure? It makes sense for the second webhook retry to fail with "the stepfunction alreadyexists message" and return 502. But why would the first one fail and require a retry? Is it a timeout maybe?

kichik avatar Feb 07 '24 15:02 kichik

you`re right should investigate in this :) will do ... timeout could be an option yes

pharindoko avatar Feb 07 '24 15:02 pharindoko

@pharindoko your error log shows truncated delivery GUID in the execution id

'arn:aws:states:********:********:execution:ghrunnergithubrunnerRunnerOrchestrator********:******'-44b33b20-af08-'"

Am I right to say that your repo's full name (<org>/<repo>) is pretty long?

We have repo level runners setup and are experiencing similar issues, like you described

seems like this occurs for 1 % of the runner requests and it`s always the same correlation between "the stepfunction alreadyexists message" <> a 502 bad gateway error <> a queued job waiting for a runner in github

I caught a specific instance today, and was able to track down to 2 webhood deliveries from the same repo where the GUID are only different after the first X characters

{
  "id": 153745296501,
  "guid": "089a2a60-4fc8-11f0-8181-e546625a3c66",
  "delivered_at": "2025-06-23T00:21:41Z",
  "redelivery": false,
  "duration": 0.73,
  "status": "502 Bad Gateway",
}
{
  "id": 153745296607,
  "guid": "089a2a60-4fc8-11f0-8f1c-58fc4d119409",
  "delivered_at": "2025-06-23T00:21:42Z",
  "redelivery": false,
  "duration": 0.71,
  "status": "OK",
}

And the error in the webhook lambda function was

Execution Already Exists: 'arn:aws:states:***:***:execution:<step-fn>:<myorg>-<my repo>-089a2a60-4fc8-11f0-8'

Note that the GUID got truncated like in your log, and it so happens in my case that the first 20 characters in the 2 GUIDs are identical, and one of them failed :(

The truncation seems to be needed as execution name is limited to 64 characters

https://github.com/CloudSnorkel/cdk-github-runners/blob/d27b5a12f7ee193dddfe03aaaf7b835a92b122e0/src/webhook-handler.lambda.ts#L179-L181

So the longer the repo name, the higher the chance of such clash.

jackie-linz avatar Jun 23 '25 03:06 jackie-linz

@kichik I think the executionName construction should be updated to preserve the entire GUID to avoid this problem of orphaned job due to truncated GUID clash. e.g. something like

  // set execution name which is also used as runner name which are limited to 64 characters
  const deliveryId = getHeader(event, 'x-github-delivery') ?? `${Math.random()}`;
  const repoNameTruncated = payload.repository.full_name.replace('/', '-').slice(0, 64 - deliveryId.length - 1);
  const executionName = `${repoNameTruncated}-${deliveryId}`;

and optionally implement the following to avoid truncating repo name too much:

  • use payload.repository.name instead of payload.repository.full_name - assuming likely that the usage is limited to single org
  • strip - characters from delivery GUID, which should save 4 characters

jackie-linz avatar Jun 23 '25 03:06 jackie-linz

Nice debugging!

kichik avatar Jun 26 '25 19:06 kichik

@kichik I think the executionName construction should be updated to preserve the entire GUID to avoid this problem of orphaned job due to truncated GUID clash. e.g. something like

  // set execution name which is also used as runner name which are limited to 64 characters
  const deliveryId = getHeader(event, 'x-github-delivery') ?? `${Math.random()}`;
  const repoNameTruncated = payload.repository.full_name.replace('/', '-').slice(0, 64 - deliveryId.length - 1);
  const executionName = `${repoNameTruncated}-${deliveryId}`;

and optionally implement the following to avoid truncating repo name too much:

  • use payload.repository.name instead of payload.repository.full_name - assuming likely that the usage is limited to single org
  • strip - characters from delivery GUID, which should save 4 characters

I use it for multiple orgs so I need the full name...

pharindoko avatar Jun 27 '25 11:06 pharindoko

I use it for multiple orgs so I need the full name...

Can you please clarify the need? Do you often have to jump in and debug these that the full name is required? It's also available in the execution parameter, I believe. Would that be enough for your use case?

kichik avatar Jun 27 '25 12:06 kichik

@pharindoko just to clarify - as long as the entire GUID is preserved, it would work for multi-org setup.

Including the org name (and even the repo name) is only there to allow admin to quickly search recent step function executions for a specific repo, in case there are any issues.

Do you have repos with the same name across multiple orgs? e.g.

  • org1/some-repo
  • org2/some-repo

This would be the scenario that is a bit worse off to diagnose issues should the org name not get included (still possible as kichik mentioned that full information is in the parameter).

jackie-linz avatar Jun 29 '25 21:06 jackie-linz