feat: add additional labels
Hey @kichik,
we have one issue with runners on org level (I assume as well on repo level). When multiple jobs run in parallel it seems like it can happen that the wrong job gets the runner and some jobs are left over still waiting for a runner.
I tried to add additional labels to the runner labels but have seen that this will be blocked by the webhook script. https://github.com/CloudSnorkel/cdk-github-runners/blob/0fc4be5f4ca524d4fe9f0c5056ff5466c2f89332/src/webhook-handler.lambda.ts#L148
feature-request: I would like to have the possiblity to add additional labels to a workflow. [self-hosted, large-runner, additional-label-for-workflow, additional-label-for-repo, additional-label]
Would mean you only check for required labels [self-hosted, large-runner] if those match and then forward it.
Can you please include the specific labels used? It will be easier to discuss with a concrete example.
As mentioned this might occur no matter if the runner are registered on org or repo level. But it will happen more often because more jobs are executed.
Background:
We register the runners now on org level - means those are visible to all repositories the github app is configured for in an organization.
The runner has label [self-hosted, large-runner] We have different workflows in two different repositories A,B in the same org that use these labels.
We trigger workflow A in repository A using workflow dispatch. This workflow A gets queued and queued...
In the meantime another workflow B that listened for a runner already for 15 hours having the label [self-hosted, large-runner] takes over this runner.
It can have several reasons why this happens e.g. user missed for organizations to select the runner group setting to support Allow public repositories and has a public repository.
I can see this directly in the cloudwatch logs because in this case the workflow job failed and returned Job FrontendDeployment completed with result: Failed
Job FrontendDeployment is only availabe in repository B for workflow B. Stepfunction succeeded, EC2 terminated... Workflow A still waits for a runner ...
That would be one of the organization level registration cons:
https://github.com/CloudSnorkel/cdk-github-runners/blob/994f474deb38c904d94d2ad824b0493bac09d942/setup/src/App.svelte#L333C1-L334C49
You can bring up more runners to take over the stolen ones by creating a new step function execution. Open an existing one from the UI and hit the New Execution button.
I'm still not sure what you originally meant with the labels. Where would the label be added to resolve this?
I assume you can have this issue no matter if it`s org level or repo runner. The propability is just higher on org level. Furthermore you have an issue because as a user you might not see all repos in the org and it`s quite hard to find jobs that got stuck and still listen for a runner if you have a lot of repos in one org.
Here`s a sample for a solution proposal:
- Runner in cdk solution created with labels: [label-for-instance-type]
- User creates a workflow with [self-hosted, label-for-instance-type, label-for-repository]
- Workflow gets triggered and webhook.lambda.ts is triggered with labels [self-hosted, label-for-instance-type, label-for-repository] matchLabelsToProvider checks only if the runner labels are included: ["self-hosted, label-for-instance-type"]
- Stepfunction determines a runner based on the labels that match a runner: ["self-hosted, label-for-instance-type"]
- All labels [self-hosted, label-for-instance-type, label-for-repository] are injected into runner and attached to the config.sh command.
- Runner is registered in github with labels [self-hosted, label-for-instance-type, label-for-repository]
This would mean that runner labels and workflow labels can be different and you can attach additional workflow labels.
I`ll try to create a PR and test this to see if this could work. @kichik please let me know if you have concerns ...
That won't prevent other repositories from using label-for-repository and still stealing the runner. Our label situation is already not the simplest to understand. I think this will make it even harder and still won't solve runner stealing.
well it`s not like people steal runners by intention. And the label-for-repository is just a sample for a tag.
You could use multiple tags and choose whatever you want. But it could happen that people just copy / paste those labels from another github action from another repository by mistake. It would still heavily reduce the propability that one runner can steal another one. And you can trace it in the logs as well.
btw if you know a better solution for this dilemma, please let me know :)
But it could happen that people just copy / paste those labels from another github action from another repository by mistake.
Exactly.
Have you considered creating multiple providers all using the same settings and image builder but with different labels? That feels like basically what you're asking for. Each runner will start with exactly the labels you want and for just the repositories you want.
btw if you know a better solution for this dilemma, please let me know :)
Repository level registration is the only thing I found. I do wish GitHub would let us aim specific runners at specific jobs, but they have no such option.
But it could happen that people just copy / paste those labels from another github action from another repository by mistake.
Exactly.
Have you considered creating multiple providers all using the same settings and image builder but with different labels? That feels like basically what you're asking for. Each runner will start with exactly the labels you want and for just the repositories you want.
No chance. Too much repos at all. This is no fun even when you have more than 10 repos.
btw if you know a better solution for this dilemma, please let me know :)
Repository level registration is the only thing I found. I do wish GitHub would let us aim specific runners at specific jobs, but they have no such option.
What I struggle with is the administration permission on repo level. A single permission to manage runners on the repository level is still missing.
Another option
Add a schedule feature to start additional runners on a regular base by
- querying cloudwatch logs to get organizations that used the runner solution e.g. the last 2 weeks and start runners
- querying the github instance to get all jobs that queued longer than e.g. 6 hours and start runners
The idleTimeout setting will kill those ec2 and deregister the runner by default if no job is taken after 5 minutes, right ?
@kichik please give me feedback I could try to create a draft PR for this topic.
No chance. Too much repos at all. This is no fun even when you have more than 10 repos.
Wouldn't you have the same problem with your proposed solution? You'd have to specify all those labels somewhere. Maybe your CDK code can automatically generate it using gh?
The schedule option is something I have considered often, for many reasons. I am still not sure exactly what I want it to look like or if it should even be a schedule and not some kind of monitoring system for stolen/missing runners.
You can already define a schedule manually, if you need to. I have done something similar for one of my installations.
const sm = runners.node.findChild('Runner Orchestrator') as stepfunctions.StateMachine;
const rule = new events.Rule(this, 'Start Runner Rule', {
schedule: events.Schedule.rate(cdk.Duration.hours(1)),
targets: [
new targets.SfnStateMachine(
sm,
{
input: events.RuleTargetInput.fromObject({
"owner": "my-org",
"repo": "my-repo",
"jobId": 0,
"jobUrl": "manual-runner",
"installationId": xxx, // TODO get from github secret
"labels": {
"self-hosted": true,
"something": true, // TODO update
},
"provider": "stack-name/provider-id" // TODO update
})
}
)
],
});
No chance. Too much repos at all. This is no fun even when you have more than 10 repos.
Wouldn't you have the same problem with your proposed solution? You'd have to specify all those labels somewhere. Maybe your CDK code can automatically generate it using
gh?
It doesn`t happen all the time right. And for most of the users it`s quite ok how it works even with org runners. But for some of them it`s not acceptable when it`s used for some production deployments. And for those it would be fine if they need to attach additional labels to make it more reliable.
The schedule option is something I have considered often, for many reasons. I am still not sure exactly what I want it to look like or if it should even be a schedule and not some kind of monitoring system for stolen/missing runners.
You can already define a schedule manually, if you need to. I have done something similar for one of my installations.
const sm = runners.node.findChild('Runner Orchestrator') as stepfunctions.StateMachine; const rule = new events.Rule(this, 'Start Runner Rule', { schedule: events.Schedule.rate(cdk.Duration.hours(1)), targets: [ new targets.SfnStateMachine( sm, { input: events.RuleTargetInput.fromObject({ "owner": "my-org", "repo": "my-repo", "jobId": 0, "jobUrl": "manual-runner", "installationId": xxx, // TODO get from github secret "labels": { "self-hosted": true, "something": true, // TODO update }, "provider": "stack-name/provider-id" // TODO update }) } ) ], });
Hmm that`s a nice idea. But ok I can`t do this statically in cdk code for all repos. I would get lost.
Create schedule to trigger a lambda. Create a lambda to query cloudwatch insights to get the owner, repo and labels aggregated for the last 2 weeks. Go through all rows found and trigger the statemachine. I`ll try. I still would like to have this as part of the cdk construct :)
It doesn
t happen all the time right. And for most of the users its quite ok how it works even with org runners. But for some of them its not acceptable when its used for some production deployments. And for those it would be fine if they need to attach additional labels to make it more reliable.
You can create a provider just for those special cases then. Isn't that basically what you were asking for? What am I missing?
Hmm that
s a nice idea. But ok I cant do this statically in cdk code for all repos. I would get lost.
Since you're registering on organization level, the repo part doesn't really matter. You can leave it as a placeholder. No need for Lambda or anything.
Ok, I found a solution that is quite ok I guess to circumvent this problem:
Added a construct with a step function and a few lambdas with following workflow:
- On the left side I get all the information of the state machine logs about the jobs transmitted (repo, owner, job, labels)
- On the right side I query github in a lambda authenticated as github app to get all the organisation installations
- Based on the "owner"|"organisation" property I merge the information
- I check the jobs and their current state (is the job queued for longer than 10 minutes ?)
- I trigger new stepfunction executions to get new runners with the specific labels registered in the right organisation.
I trigger this step function each half an hour.
What would be great to get as output from the GithubRunners construct:
- the statemachine arn (use to trigger new step function executions
- the loggroup name (used for the cloudwatch insights query)
Currently this is what I need to execute the construct:
const stateMachine = githubRunners.node.children.find((child) => child instanceof stepfunctions.StateMachine)! as stepfunctions.StateMachine;
const logGroupName = `${this.stackName}-state-machine`;
new GithubScheduledRunnerHandler(this, 'github-scheduled-runner', {
internalSubnetIds: props.internalSubnetIds,
vpcId: props.vpcId,
logGroupName,
githubSecretId: githubRunners.secrets.github.secretName,
githubPrivateKeySecretId: githubRunners.secrets.githubPrivateKey.secretName,
mappingTable: providers.map((provider) => { return { "provider": provider.toString(), "labels": provider.labels } }),
stateMachineArn: stateMachine.stateMachineArn,
githubRunnerWaitTimeInMinutes: 10,
});
Would you be interested to integrate this construct in your solution ? Should I add a PR or should I create a separate construct ? I could share the code in a private repo in a simplified version (not a perfect construct) if you want to get more details. Please let me know @kichik.
I do want something like this one day, but I don't think we have all the details figured out yet. For example, when using ECS with a limited ASG, a job might be pending for >10 minutes just because the cluster is busy. If we keep submitting more jobs, it may only make the situation worse.
I do like that it doesn't go off just by the webhook. I have seen cases where the webhook was just not called and so runners were missing.
Please do share the code, if you can. I think I may have misunderstood the matching you were doing between jobs and the existing state machine. It might already cover the problem I mentioned.
@kichik hope you have seen my invitation - created a private repo for this.
@kichik hope you have seen my invitation - created a private repo for this.
I did. Thanks.
After more than week of running the scheduler step function I found out there a clear correlation between the webhook lambda function error message ExecutionAlreadyExists and new runners triggered by my scheduler step function. It only seems to happen in this case.
Detailed error message in cloudwatch:
__type
com.amazonaws.swf.service.v2.model#ExecutionAlreadyExists
$fault
client
$metadata.attempts
1
$metadata.httpStatusCode
400
$metadata.requestId
f1cad492-8906-4967-9c99-f920a010bf90
$metadata.totalRetryDelay
0
@ingestionTime
1704816064769
@log
********:/aws/lambda/gh-runner-ghrunnergithubrunnerWebhookHandlerwebhoo....
@logStream
2024/01/09/[$LATEST]e18f1646a851483d93a0490663a2cde4
@message
2024-01-09T16:00:56.047Z f8bf04d7-561b-4d7a-bbd0-b66f23121b91 ERROR Invoke Error {"errorType":"ExecutionAlreadyExists","errorMessage":"Execution Already Exists: 'arn:aws:states:********:********:execution:ghrunnergithubrunnerRunnerOrchestrator********:******'-44b33b20-af08-'","name":"ExecutionAlreadyExists","$fault":"client","$metadata":{"httpStatusCode":400,"requestId":"f1cad492-8906-4967-9c99-f920a010bf90","attempts":1,"totalRetryDelay":0},"__type":"com.amazonaws.swf.service.v2.model#ExecutionAlreadyExists","stack":["ExecutionAlreadyExists: Execution Already Exists: 'arn:aws:states:********:*******:execution:ghrunnergithubrunnerRunnerOrchestrator********:*****'"," at de_ExecutionAlreadyExistsRes (/var/runtime/node_modules/@aws-sdk/client-sfn/dist-cjs/protocols/Aws_json1_0.js:1694:23)"," at de_StartExecutionCommandError (/var/runtime/node_modules/@aws-sdk/client-sfn/dist-cjs/protocols/Aws_json1_0.js:1321:25)"," at process.processTicksAndRejections (node:internal/process/task_queues:95:5)"," at async /var/runtime/node_modules/@aws-sdk/middleware-serde/dist-cjs/deserializerMiddleware.js:7:24"," at async /var/runtime/node_modules/@aws-sdk/middleware-signing/dist-cjs/awsAuthMiddleware.js:14:20"," at async /var/runtime/node_modules/@aws-sdk/middleware-retry/dist-cjs/retryMiddleware.js:27:46"," at async /var/runtime/node_modules/@aws-sdk/middleware-logger/dist-cjs/loggerMiddleware.js:7:26"," at async Runtime.handler (/var/task/index.js:13997:21)"]}
@requestId
f8bf04d7-561b-4d7a-bbd0-b66f23121b91
@timestamp
1704816056047
The name uses GitHub webhook delivery id to remain unique.
https://github.com/CloudSnorkel/cdk-github-runners/blob/06ffc53c8067027748a4447d3f3b248fabb94131/src/webhook-handler.lambda.ts#L167
Do you maybe have the same webhook hooked up more than once somehow?
The name uses GitHub webhook delivery id to remain unique.
https://github.com/CloudSnorkel/cdk-github-runners/blob/06ffc53c8067027748a4447d3f3b248fabb94131/src/webhook-handler.lambda.ts#L167
Do you maybe have the same webhook hooked up more than once somehow?
Ok then I assume it might be the apigw/lambda retry mechanism... will check it.
I assume that this really only happens when the apigw (my internal customlambdaaccess implementation) does a retry...
seems like this occurs for 1 % of the runner requests and it`s always the same correlation between "the stepfunction alreadyexists message" <> a 502 bad gateway error <> a queued job waiting for a runner in github
Can you tell what's the initial failure? It makes sense for the second webhook retry to fail with "the stepfunction alreadyexists message" and return 502. But why would the first one fail and require a retry? Is it a timeout maybe?
you`re right should investigate in this :) will do ... timeout could be an option yes
@pharindoko your error log shows truncated delivery GUID in the execution id
'arn:aws:states:********:********:execution:ghrunnergithubrunnerRunnerOrchestrator********:******'-44b33b20-af08-'"
Am I right to say that your repo's full name (<org>/<repo>) is pretty long?
We have repo level runners setup and are experiencing similar issues, like you described
seems like this occurs for 1 % of the runner requests and it`s always the same correlation between "the stepfunction alreadyexists message" <> a 502 bad gateway error <> a queued job waiting for a runner in github
I caught a specific instance today, and was able to track down to 2 webhood deliveries from the same repo where the GUID are only different after the first X characters
{
"id": 153745296501,
"guid": "089a2a60-4fc8-11f0-8181-e546625a3c66",
"delivered_at": "2025-06-23T00:21:41Z",
"redelivery": false,
"duration": 0.73,
"status": "502 Bad Gateway",
}
{
"id": 153745296607,
"guid": "089a2a60-4fc8-11f0-8f1c-58fc4d119409",
"delivered_at": "2025-06-23T00:21:42Z",
"redelivery": false,
"duration": 0.71,
"status": "OK",
}
And the error in the webhook lambda function was
Execution Already Exists: 'arn:aws:states:***:***:execution:<step-fn>:<myorg>-<my repo>-089a2a60-4fc8-11f0-8'
Note that the GUID got truncated like in your log, and it so happens in my case that the first 20 characters in the 2 GUIDs are identical, and one of them failed :(
The truncation seems to be needed as execution name is limited to 64 characters
https://github.com/CloudSnorkel/cdk-github-runners/blob/d27b5a12f7ee193dddfe03aaaf7b835a92b122e0/src/webhook-handler.lambda.ts#L179-L181
So the longer the repo name, the higher the chance of such clash.
@kichik I think the executionName construction should be updated to preserve the entire GUID to avoid this problem of orphaned job due to truncated GUID clash. e.g. something like
// set execution name which is also used as runner name which are limited to 64 characters
const deliveryId = getHeader(event, 'x-github-delivery') ?? `${Math.random()}`;
const repoNameTruncated = payload.repository.full_name.replace('/', '-').slice(0, 64 - deliveryId.length - 1);
const executionName = `${repoNameTruncated}-${deliveryId}`;
and optionally implement the following to avoid truncating repo name too much:
- use
payload.repository.nameinstead ofpayload.repository.full_name- assuming likely that the usage is limited to single org - strip
-characters from delivery GUID, which should save 4 characters
Nice debugging!
@kichik I think the
executionNameconstruction should be updated to preserve the entire GUID to avoid this problem of orphaned job due to truncated GUID clash. e.g. something like// set execution name which is also used as runner name which are limited to 64 characters const deliveryId = getHeader(event, 'x-github-delivery') ?? `${Math.random()}`; const repoNameTruncated = payload.repository.full_name.replace('/', '-').slice(0, 64 - deliveryId.length - 1); const executionName = `${repoNameTruncated}-${deliveryId}`;and optionally implement the following to avoid truncating repo name too much:
- use
payload.repository.nameinstead ofpayload.repository.full_name- assuming likely that the usage is limited to single org- strip
-characters from delivery GUID, which should save 4 characters
I use it for multiple orgs so I need the full name...
I use it for multiple orgs so I need the full name...
Can you please clarify the need? Do you often have to jump in and debug these that the full name is required? It's also available in the execution parameter, I believe. Would that be enough for your use case?
@pharindoko just to clarify - as long as the entire GUID is preserved, it would work for multi-org setup.
Including the org name (and even the repo name) is only there to allow admin to quickly search recent step function executions for a specific repo, in case there are any issues.
Do you have repos with the same name across multiple orgs? e.g.
org1/some-repoorg2/some-repo
This would be the scenario that is a bit worse off to diagnose issues should the org name not get included (still possible as kichik mentioned that full information is in the parameter).