node-problem-detector ci-npd-build is flaky

ci-npd-build has been flaky since 1/8/2025, with only 40% passing rate.

#26 [linux/arm64 builder 5/5] RUN GOARCH=arm64 make bin/node-problem-detector bin/health-checker bin/log-counter
#26 289.5 runtime/cgo: aarch64-linux-gnu-gcc: signal: segmentation fault (core dumped)
#26 897.3 make: *** [Makefile:200: bin/node-problem-detector] Error 1
#26 ERROR: process "/dev/.buildkit_qemu_emulator /bin/sh -c GOARCH=${TARGETARCH} make bin/node-problem-detector bin/health-checker bin/log-counter" did not complete successfully: exit code: 2
------
 > [linux/arm64 builder 5/5] RUN GOARCH=arm64 make bin/node-problem-detector bin/health-checker bin/log-counter:
31.00 CGO_ENABLED=1 GOOS=linux GOARCH=arm64 CC=aarch64-linux-gnu-gcc go build \
31.00 	-o bin/node-problem-detector \
31.00 	-ldflags '-X k8s.io/node-problem-detector/pkg/version.version=v0.8.20-41-g12a8f55' \
31.00 	-tags "journald " \
31.00 	./cmd/nodeproblemdetector
289.5 runtime/cgo: aarch64-linux-gnu-gcc: signal: segmentation fault (core dumped)
897.3 make: *** [Makefile:200: bin/node-problem-detector] Error 1
------
WARNING: No output specified with docker-container driver. Build result will only remain in the build cache. To push result image into registry use --push or to load image into docker use --load
Dockerfile:36
--------------------
  34 |     COPY . /gopath/src/k8s.io/node-problem-detector/
  35 |     WORKDIR /gopath/src/k8s.io/node-problem-detector
  36 | >>> RUN GOARCH=${TARGETARCH} make bin/node-problem-detector bin/health-checker bin/log-counter
  37 |     
  38 |     FROM --platform=${TARGETPLATFORM} registry.k8s.io/build-image/debian-base:bookworm-v1.0.4@sha256:0a17678966f63e82e9c5e246d9e654836a33e13650a698adefede61bb5ca099e as base
--------------------
ERROR: failed to solve: process "/dev/.buildkit_qemu_emulator /bin/sh -c GOARCH=${TARGETARCH} make bin/node-problem-detector bin/health-checker bin/log-counter" did not complete successfully: exit code: 2
make: *** [Makefile:245: build-container] Error 1
+ EXIT_VALUE=2
+ set +o xtrace
Cleaning up after docker in docker.

Jan 20 '25 05:01 wangzhen127

The latest test infra change is on 12/30/2024: https://github.com/kubernetes/test-infra/commits/master/config/jobs/kubernetes/node-problem-detector/node-problem-detector-ci.yaml

The latest NPD change is on 1/7/2025: https://github.com/kubernetes/node-problem-detector/commits/master/

It does not look like related to any change.

Jan 20 '25 05:01 wangzhen127

@BenTheElder Do you know if anything could cause this?

Jan 20 '25 05:01 wangzhen127

cc @DigitalVeer

Jan 24 '25 20:01 wangzhen127

looks pretty clear that the compiler is segfaulting under emulation (it fails when building arm64, which is running under qemu in buildkit)

it could be the version of qemu in the image

FWIW I highly recommend not compiling under emulation for performance and reliability reasons, instead you can cross-compile on the host architecture to the target architecture then copy that output to an image for the target architecture using a multi-stage build

see for example: https://github.com/kubernetes-sigs/kind/blob/78cdad26107b27f0b0bc5ad5a878ef41ecab2705/images/local-path-provisioner/Dockerfile#L16-L22 (NOTE use of $TARGETARCH, $BUILDPLATFORM for the build step versus the final step) https://www.docker.com/blog/faster-multi-platform-builds-dockerfile-cross-compilation-guide/

Jan 24 '25 21:01 BenTheElder

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 24 '25 22:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

May 24 '25 22:05 k8s-triage-robot

/remove-lifecycle rotten

Jun 02 '25 21:06 wangzhen127

/lifecycle frozen

Aug 13 '25 06:08 hakman

some experiments: https://github.com/kubernetes/node-problem-detector/pull/1103

Aug 14 '25 00:08 SergeyKanzhelev

The flakiness has been greatly reduced by the recent changes. I think we can close this for now.

Sep 28 '25 07:09 hakman

node-problem-detector node-problem-detector copied to clipboard

ci-npd-build is flaky

node-problem-detector
node-problem-detector copied to clipboard