community
community copied to clipboard
Proposal: GitHub self-hosted runners for specific use cases
The OpenTelemetry Go SIG has benchmark tests that we currently run manually when need, but would like to run automatically. We used to run those benchmark tests using GH Actions. But an action running on GitHub's infrastructure often has noisy neighbours, making the tests unreliable. So we disabled them.
Discussing with @TylerHelmuth, I (and him) got access to the CNCF's equinix account, which would allow us to boot dedicated virtual machines. Our wish would be to be able to run our benchmark tests in self-hosted runners.
The easiest way to setup those runners would be to do it manually for the repository. This seems brittle though, and prone to mistakes. We could also setup terraform (or any other automation system) to auto-provision the runners. AFAIK, there is no such automation within the otel organization at the moment though.
It also seems like setting up self-hosted runner just for the Go repository is a bit overkill, as we would be under-using them. Other SIGs may have similar needs though. So having them at org-level may be better?
So here come this issue to gather feedback and opinions about this need.
cc @jmacd
The technical committee discussed the risks and benefits of adopting self-hosted runners. Here are some of the points discussed:
- If possible, benchmarking using CPU performance counters can overcome the problem of noise due to shared-resource contention. (Note that this is not currently built-in to the Golang benchmarks suite to help in OTel-Go.)
- There is a definite use-case for self-hosted runners that involves correctness testing, particularly for resource detectors that need to be run in actual Cloud-vendor environments (e.g., testing AWS EC2 resource detection on an AWS EC2 machine).
- Committee worries about security (e.g., mis-use of the resource)
Our overall recommendation is to see if we can avoid this additional configuration and management associated with self-hosted runners; if there's sufficient interest for correctness integration tests that may be more compelling; perhaps the OTel-Go team can find a workaround?; perhaps the engineering time would be better spent manually reviewing benchmark results prior to releases?
By the way: cirun.io does the same, without adding maintenance burden.
There are multiple other services that provide value for this: buildjet.com is another example.
wanted to resurrect this proposal - It'd definitely help with velocity and stability if we had consistent build servers for our projects.
I've opened a CNCF ticket to ask just in case they already have some ARM64 runners available at the CNCF GitHub Enterprise level that we can use.
the CNCF pointed to setting up self-hosted ARM64 runners using https://github.com/cncf/cluster (which I realized afterwards @dmathieu mentioned above when opening this issue).
pulling down the TC recommendation from above https://github.com/open-telemetry/community/issues/1162#issuecomment-1254299054:
Our overall recommendation is to see if we can avoid this additional configuration and management associated with self-hosted runners; if there's sufficient interest for correctness integration tests that may be more compelling; perhaps the OTel-Go team can find a workaround?; perhaps the engineering time would be better spent manually reviewing benchmark results prior to releases?
@bobstrecansky is there something in PHP repo that needs special care around ARM64 testing? have you seen issues with things not working or breaking on ARM64? or is the desire for automated ARM64 testing more out of an "abundance of caution"?
@trask more of the latter - ARM support for our testing matrix would be a welcome addition. I'm sure other SIGs would probably like to have that as well.
It will be great to have it also for .NET AutoInstrumentation. See: https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/issues/1865
@trask - also - I don't think the github runners are open source? That may be a loose requirement: https://github.com/cncf/cluster#usage-guidelines
Watching and hoping... https://github.com/actions/runner-images/issues/5631
We've made some progress and now have a runner that can be used for benchmarks: https://github.com/open-telemetry/community/issues/1662
(Note: Permission must be granted for each workflow individually to avoid abuse.)
FYI we now have access to Arm GitHub runners, see https://github.com/open-telemetry/community/issues/1821, and you can open a repo maintenance request to get access
@dmathieu will that resolve this issue?
Thank you @trask. Yes, this should be what we need. We'll be looking into it. In the mean time, I do believe this issue can be closed.