beam icon indicating copy to clipboard operation
beam copied to clipboard

[Bug]: Building wheels failing on self hosted actions, working on github's runners

Open damccorm opened this issue 11 months ago • 8 comments

What happened?

Starting Dec 10, building Python wheels started failing with a bunch of seg faults. It seems likely due to something with the underlying hardware or image. I tried updating to using GitHub hosted runners and it seems like this works. We can do this as a workaround, but we should understand the problem and switch back to self-hosted to avoid being blocked on GitHub quota.

Example failure - https://github.com/apache/beam/actions/runs/12625457564

This also impacts some other workflows which I will switch over to GitHub hosted, but we should similarly switch back

Workflows where we've seen this issue along with the PR used to temporarily mitigate:

  • [ ] .github/workflows/build_wheels.yml - https://github.com/apache/beam/pull/33505
  • [ ] .github/workflows/republish_released_docker_containers.yml - https://github.com/apache/beam/pull/33507
  • [ ] .github/workflows/beam_Publish_Beam_SDK_Snapshots.yml - https://github.com/apache/beam/pull/33563
  • [ ] .github/workflows/beam_PostCommit_Python_Arm.yml - https://github.com/apache/beam/pull/33564

We should figure out what is causing the problem and then revert all these PRs

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • [X] Component: Python SDK
  • [ ] Component: Java SDK
  • [ ] Component: Go SDK
  • [ ] Component: Typescript SDK
  • [ ] Component: IO connector
  • [ ] Component: Beam YAML
  • [ ] Component: Beam examples
  • [ ] Component: Beam playground
  • [ ] Component: Beam katas
  • [ ] Component: Website
  • [X] Component: Infrastructure
  • [ ] Component: Spark Runner
  • [ ] Component: Flink Runner
  • [ ] Component: Samza Runner
  • [ ] Component: Twister2 Runner
  • [ ] Component: Hazelcast Jet Runner
  • [ ] Component: Google Cloud Dataflow Runner

damccorm avatar Jan 06 '25 19:01 damccorm

@claudevdm did some good investigation here - Claude would you mind adding in any investigation you've done/things you've tried

damccorm avatar Jan 06 '25 19:01 damccorm

Sure, here is what I found

  • The workflow started failing consistently on Dec 10 image
  • The only difference I could find since they started failing is a runner/host Kernel Version: 6.1.100+ to Kernel Version: 6.1.112+ change
  • The kernel date change lines up with a release on Dec 10 image
  • The arm workflows have to do some cross-compilation thing using qemu
  • cibuildwheels pulls quay.io/pypa/manylinux2014_aarch64 which uses gcc 10 to compile the cyton code
  • It seems there are some incompatible changes with the new kernel release and cross compiling with gcc 10

claudevdm avatar Jan 07 '25 14:01 claudevdm

FYI @Amar3tto @mrshakirov @akashorabek - this should fix a few flaky workloads, but it is just a patch. It would be great to dig in on this one and try to figure out what in our self-hosted runner infrastructure is causing the flakes so that we can move these workloads back to our self-hosted runners.

damccorm avatar Jan 10 '25 15:01 damccorm

I tried updating the Docker Engine and Buildx to the latest available version. I also explicitly specified the latest manylinux version using the parameter CIBW_MANYLINUX_AARCH64_IMAGE: "manylinux2014_aarch64:2025.02.02-1", since by default it was pulling a cached (not the latest) version. However, the issue still persisted.

I also noticed that a new version of QEMU was released on December 10 (the day after the workflow started failing). I thought that might be the cause, but it was only added to setup-qemu-action about a week ago, so it's unlikely to be related.

Interestingly, this issue also appeared in GitHub-hosted runners after January 23, but only for Ubuntu 20.04 and 24.04. It might be worth trying to update Ubuntu to 22.04 on our self-hosted runners.

akashorabek avatar Feb 04 '25 05:02 akashorabek

Issue still persisted after updating ubuntu to 22.04 on self-hosted runners.

akashorabek avatar Feb 10 '25 18:02 akashorabek

This issue has been marked as stale due to 150 days of inactivity. It will be closed in 30 days if no further activity occurs. If you think that’s incorrect or this issue still needs to be addressed, please simply write any comment. If closed, you can reopen the issue at any time. Thank you for your contributions.

github-actions[bot] avatar Jul 11 '25 12:07 github-actions[bot]

It would be great to figure this one out eventually, commenting to keep it from being stale

damccorm avatar Jul 11 '25 18:07 damccorm

This issue has been marked as stale due to 150 days of inactivity. It will be closed in 30 days if no further activity occurs. If you think that’s incorrect or this issue still needs to be addressed, please simply write any comment. If closed, you can reopen the issue at any time. Thank you for your contributions.

github-actions[bot] avatar Dec 09 '25 12:12 github-actions[bot]