node-problem-detector
node-problem-detector copied to clipboard
ci-npd-build is flaky
ci-npd-build has been flaky since 1/8/2025, with only 40% passing rate.
#26 [linux/arm64 builder 5/5] RUN GOARCH=arm64 make bin/node-problem-detector bin/health-checker bin/log-counter
#26 289.5 runtime/cgo: aarch64-linux-gnu-gcc: signal: segmentation fault (core dumped)
#26 897.3 make: *** [Makefile:200: bin/node-problem-detector] Error 1
#26 ERROR: process "/dev/.buildkit_qemu_emulator /bin/sh -c GOARCH=${TARGETARCH} make bin/node-problem-detector bin/health-checker bin/log-counter" did not complete successfully: exit code: 2
------
> [linux/arm64 builder 5/5] RUN GOARCH=arm64 make bin/node-problem-detector bin/health-checker bin/log-counter:
31.00 CGO_ENABLED=1 GOOS=linux GOARCH=arm64 CC=aarch64-linux-gnu-gcc go build \
31.00 -o bin/node-problem-detector \
31.00 -ldflags '-X k8s.io/node-problem-detector/pkg/version.version=v0.8.20-41-g12a8f55' \
31.00 -tags "journald " \
31.00 ./cmd/nodeproblemdetector
289.5 runtime/cgo: aarch64-linux-gnu-gcc: signal: segmentation fault (core dumped)
897.3 make: *** [Makefile:200: bin/node-problem-detector] Error 1
------
WARNING: No output specified with docker-container driver. Build result will only remain in the build cache. To push result image into registry use --push or to load image into docker use --load
Dockerfile:36
--------------------
34 | COPY . /gopath/src/k8s.io/node-problem-detector/
35 | WORKDIR /gopath/src/k8s.io/node-problem-detector
36 | >>> RUN GOARCH=${TARGETARCH} make bin/node-problem-detector bin/health-checker bin/log-counter
37 |
38 | FROM --platform=${TARGETPLATFORM} registry.k8s.io/build-image/debian-base:bookworm-v1.0.4@sha256:0a17678966f63e82e9c5e246d9e654836a33e13650a698adefede61bb5ca099e as base
--------------------
ERROR: failed to solve: process "/dev/.buildkit_qemu_emulator /bin/sh -c GOARCH=${TARGETARCH} make bin/node-problem-detector bin/health-checker bin/log-counter" did not complete successfully: exit code: 2
make: *** [Makefile:245: build-container] Error 1
+ EXIT_VALUE=2
+ set +o xtrace
Cleaning up after docker in docker.
The latest test infra change is on 12/30/2024: https://github.com/kubernetes/test-infra/commits/master/config/jobs/kubernetes/node-problem-detector/node-problem-detector-ci.yaml
The latest NPD change is on 1/7/2025: https://github.com/kubernetes/node-problem-detector/commits/master/
It does not look like related to any change.
@BenTheElder Do you know if anything could cause this?
cc @DigitalVeer
looks pretty clear that the compiler is segfaulting under emulation (it fails when building arm64, which is running under qemu in buildkit)
it could be the version of qemu in the image
FWIW I highly recommend not compiling under emulation for performance and reliability reasons, instead you can cross-compile on the host architecture to the target architecture then copy that output to an image for the target architecture using a multi-stage build
see for example: https://github.com/kubernetes-sigs/kind/blob/78cdad26107b27f0b0bc5ad5a878ef41ecab2705/images/local-path-provisioner/Dockerfile#L16-L22 (NOTE use of $TARGETARCH, $BUILDPLATFORM for the build step versus the final step) https://www.docker.com/blog/faster-multi-platform-builds-dockerfile-cross-compilation-guide/
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
/lifecycle frozen
some experiments: https://github.com/kubernetes/node-problem-detector/pull/1103
The flakiness has been greatly reduced by the recent changes. I think we can close this for now.