node-problem-detector icon indicating copy to clipboard operation
node-problem-detector copied to clipboard

ci-npd-build is flaky

Open wangzhen127 opened this issue 10 months ago • 4 comments

ci-npd-build has been flaky since 1/8/2025, with only 40% passing rate.

#26 [linux/arm64 builder 5/5] RUN GOARCH=arm64 make bin/node-problem-detector bin/health-checker bin/log-counter
#26 289.5 runtime/cgo: aarch64-linux-gnu-gcc: signal: segmentation fault (core dumped)
#26 897.3 make: *** [Makefile:200: bin/node-problem-detector] Error 1
#26 ERROR: process "/dev/.buildkit_qemu_emulator /bin/sh -c GOARCH=${TARGETARCH} make bin/node-problem-detector bin/health-checker bin/log-counter" did not complete successfully: exit code: 2
------
 > [linux/arm64 builder 5/5] RUN GOARCH=arm64 make bin/node-problem-detector bin/health-checker bin/log-counter:
31.00 CGO_ENABLED=1 GOOS=linux GOARCH=arm64 CC=aarch64-linux-gnu-gcc go build \
31.00 	-o bin/node-problem-detector \
31.00 	-ldflags '-X k8s.io/node-problem-detector/pkg/version.version=v0.8.20-41-g12a8f55' \
31.00 	-tags "journald " \
31.00 	./cmd/nodeproblemdetector
289.5 runtime/cgo: aarch64-linux-gnu-gcc: signal: segmentation fault (core dumped)
897.3 make: *** [Makefile:200: bin/node-problem-detector] Error 1
------
WARNING: No output specified with docker-container driver. Build result will only remain in the build cache. To push result image into registry use --push or to load image into docker use --load
Dockerfile:36
--------------------
  34 |     COPY . /gopath/src/k8s.io/node-problem-detector/
  35 |     WORKDIR /gopath/src/k8s.io/node-problem-detector
  36 | >>> RUN GOARCH=${TARGETARCH} make bin/node-problem-detector bin/health-checker bin/log-counter
  37 |     
  38 |     FROM --platform=${TARGETPLATFORM} registry.k8s.io/build-image/debian-base:bookworm-v1.0.4@sha256:0a17678966f63e82e9c5e246d9e654836a33e13650a698adefede61bb5ca099e as base
--------------------
ERROR: failed to solve: process "/dev/.buildkit_qemu_emulator /bin/sh -c GOARCH=${TARGETARCH} make bin/node-problem-detector bin/health-checker bin/log-counter" did not complete successfully: exit code: 2
make: *** [Makefile:245: build-container] Error 1
+ EXIT_VALUE=2
+ set +o xtrace
Cleaning up after docker in docker.

wangzhen127 avatar Jan 20 '25 05:01 wangzhen127

Image

The latest test infra change is on 12/30/2024: https://github.com/kubernetes/test-infra/commits/master/config/jobs/kubernetes/node-problem-detector/node-problem-detector-ci.yaml

The latest NPD change is on 1/7/2025: https://github.com/kubernetes/node-problem-detector/commits/master/

It does not look like related to any change.

wangzhen127 avatar Jan 20 '25 05:01 wangzhen127

@BenTheElder Do you know if anything could cause this?

wangzhen127 avatar Jan 20 '25 05:01 wangzhen127

cc @DigitalVeer

wangzhen127 avatar Jan 24 '25 20:01 wangzhen127

looks pretty clear that the compiler is segfaulting under emulation (it fails when building arm64, which is running under qemu in buildkit)

it could be the version of qemu in the image

FWIW I highly recommend not compiling under emulation for performance and reliability reasons, instead you can cross-compile on the host architecture to the target architecture then copy that output to an image for the target architecture using a multi-stage build

see for example: https://github.com/kubernetes-sigs/kind/blob/78cdad26107b27f0b0bc5ad5a878ef41ecab2705/images/local-path-provisioner/Dockerfile#L16-L22 (NOTE use of $TARGETARCH, $BUILDPLATFORM for the build step versus the final step) https://www.docker.com/blog/faster-multi-platform-builds-dockerfile-cross-compilation-guide/

BenTheElder avatar Jan 24 '25 21:01 BenTheElder

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 24 '25 22:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar May 24 '25 22:05 k8s-triage-robot

/remove-lifecycle rotten

wangzhen127 avatar Jun 02 '25 21:06 wangzhen127

/lifecycle frozen

hakman avatar Aug 13 '25 06:08 hakman

some experiments: https://github.com/kubernetes/node-problem-detector/pull/1103

SergeyKanzhelev avatar Aug 14 '25 00:08 SergeyKanzhelev

The flakiness has been greatly reduced by the recent changes. I think we can close this for now.

hakman avatar Sep 28 '25 07:09 hakman