community icon indicating copy to clipboard operation
community copied to clipboard

Proposal: GitHub self-hosted runners for specific use cases

Open dmathieu opened this issue 2 years ago • 1 comments

The OpenTelemetry Go SIG has benchmark tests that we currently run manually when need, but would like to run automatically. We used to run those benchmark tests using GH Actions. But an action running on GitHub's infrastructure often has noisy neighbours, making the tests unreliable. So we disabled them.

Discussing with @TylerHelmuth, I (and him) got access to the CNCF's equinix account, which would allow us to boot dedicated virtual machines. Our wish would be to be able to run our benchmark tests in self-hosted runners.

The easiest way to setup those runners would be to do it manually for the repository. This seems brittle though, and prone to mistakes. We could also setup terraform (or any other automation system) to auto-provision the runners. AFAIK, there is no such automation within the otel organization at the moment though.

It also seems like setting up self-hosted runner just for the Go repository is a bit overkill, as we would be under-using them. Other SIGs may have similar needs though. So having them at org-level may be better?

So here come this issue to gather feedback and opinions about this need.

dmathieu avatar Sep 09 '22 08:09 dmathieu

cc @jmacd

dmathieu avatar Sep 09 '22 08:09 dmathieu

The technical committee discussed the risks and benefits of adopting self-hosted runners. Here are some of the points discussed:

  • If possible, benchmarking using CPU performance counters can overcome the problem of noise due to shared-resource contention. (Note that this is not currently built-in to the Golang benchmarks suite to help in OTel-Go.)
  • There is a definite use-case for self-hosted runners that involves correctness testing, particularly for resource detectors that need to be run in actual Cloud-vendor environments (e.g., testing AWS EC2 resource detection on an AWS EC2 machine).
  • Committee worries about security (e.g., mis-use of the resource)

Our overall recommendation is to see if we can avoid this additional configuration and management associated with self-hosted runners; if there's sufficient interest for correctness integration tests that may be more compelling; perhaps the OTel-Go team can find a workaround?; perhaps the engineering time would be better spent manually reviewing benchmark results prior to releases?

jmacd avatar Sep 21 '22 22:09 jmacd

By the way: cirun.io does the same, without adding maintenance burden.

aktech avatar Oct 10 '22 23:10 aktech

There are multiple other services that provide value for this: buildjet.com is another example.

wanted to resurrect this proposal - It'd definitely help with velocity and stability if we had consistent build servers for our projects.

bobstrecansky avatar Feb 28 '23 13:02 bobstrecansky

I've opened a CNCF ticket to ask just in case they already have some ARM64 runners available at the CNCF GitHub Enterprise level that we can use.

trask avatar Mar 01 '23 19:03 trask

the CNCF pointed to setting up self-hosted ARM64 runners using https://github.com/cncf/cluster (which I realized afterwards @dmathieu mentioned above when opening this issue).

pulling down the TC recommendation from above https://github.com/open-telemetry/community/issues/1162#issuecomment-1254299054:

Our overall recommendation is to see if we can avoid this additional configuration and management associated with self-hosted runners; if there's sufficient interest for correctness integration tests that may be more compelling; perhaps the OTel-Go team can find a workaround?; perhaps the engineering time would be better spent manually reviewing benchmark results prior to releases?

@bobstrecansky is there something in PHP repo that needs special care around ARM64 testing? have you seen issues with things not working or breaking on ARM64? or is the desire for automated ARM64 testing more out of an "abundance of caution"?

trask avatar Mar 01 '23 23:03 trask

@trask more of the latter - ARM support for our testing matrix would be a welcome addition. I'm sure other SIGs would probably like to have that as well.

bobstrecansky avatar Mar 01 '23 23:03 bobstrecansky

It will be great to have it also for .NET AutoInstrumentation. See: https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/issues/1865

Kielek avatar Mar 02 '23 05:03 Kielek

@trask - also - I don't think the github runners are open source? That may be a loose requirement: https://github.com/cncf/cluster#usage-guidelines

bobstrecansky avatar Mar 09 '23 17:03 bobstrecansky

Watching and hoping... https://github.com/actions/runner-images/issues/5631

trask avatar Jul 06 '23 00:07 trask

We've made some progress and now have a runner that can be used for benchmarks: https://github.com/open-telemetry/community/issues/1662

(Note: Permission must be granted for each workflow individually to avoid abuse.)

tylerbenson avatar Sep 27 '23 18:09 tylerbenson

FYI we now have access to Arm GitHub runners, see https://github.com/open-telemetry/community/issues/1821, and you can open a repo maintenance request to get access

@dmathieu will that resolve this issue?

trask avatar Feb 06 '24 04:02 trask

Thank you @trask. Yes, this should be what we need. We'll be looking into it. In the mean time, I do believe this issue can be closed.

dmathieu avatar Feb 07 '24 08:02 dmathieu