helpdesk icon indicating copy to clipboard operation
helpdesk copied to clipboard

Restore jenkinsci/jenkins build stability

Open daniel-beck opened this issue 6 months ago • 7 comments

Service(s)

ci.jenkins.io

Summary

There hasn't been a stable build of jenkinsci/jenkins:master for several days. Looking over recent builds, around half of the finished builds are unstable on the master branch, with random tests failing.

  • https://ci.jenkins.io/job/Core/job/jenkins/job/master/7392/testReport/ / 4f6921b
    • linux-jdk21 / Linux - JDK 21 - Build / Test / hudson.model.ExecutorTest.disconnectCause
    • https://github.com/jenkinsci/jenkins/pull/10700
  • https://ci.jenkins.io/job/Core/job/jenkins/job/master/7390/testReport/ / 4b52a15
    • linux-jdk17 / Linux - JDK 17 - Build / Test / hudson.model.AbstractProjectTest.wipeWorkspaceProtected2
  • https://ci.jenkins.io/job/Core/job/jenkins/job/master/7382/testReport/ / 4d16ee6
    • linux-jdk17 / Linux - JDK 17 - Build / Test / hudson.model.ExecutorTest.disconnectCause
    • https://github.com/jenkinsci/jenkins/pull/10700
  • https://ci.jenkins.io/job/Core/job/jenkins/job/master/7378/testReport/ / 7aa9647
    • linux-jdk17 / Linux - JDK 17 - Build / Test / hudson.model.ComputerTest.offlineCauseRemainsAfterTemporaryCauseRemoved
  • https://ci.jenkins.io/job/Core/job/jenkins/job/master/7369/testReport/ / 0d5bfda
    • windows-jdk17 / Windows - JDK 17 - Build / Test / hudson.model.QueueTest.inQueueTaskLookupByAPI
  • https://ci.jenkins.io/job/Core/job/jenkins/job/master/7363/testReport/ / 5abdacd
    • linux-jdk17 / Linux - JDK 17 - Build / Test / hudson.model.ExecutorTest.disconnectCause
    • https://github.com/jenkinsci/jenkins/pull/10700
  • https://ci.jenkins.io/job/Core/job/jenkins/job/master/7361/testReport/ / 03bc1db
    • linux-jdk21 / Linux - JDK 21 - Build / Test / hudson.node_monitors.ResponseTimeMonitorTest.skipOfflineAgent
    • https://github.com/jenkinsci/jenkins/pull/10654
  • https://ci.jenkins.io/job/Core/job/jenkins/job/master/7360/testReport/ / ed4e4d9
    • linux-jdk21 / Linux - JDK 21 - Build / Test / hudson.node_monitors.ResponseTimeMonitorTest.skipOfflineAgent
    • https://github.com/jenkinsci/jenkins/pull/10654
  • https://ci.jenkins.io/job/Core/job/jenkins/job/master/7359/testReport/ / 45e9f57
    • linux-jdk21 / Linux - JDK 21 - Build / Test / hudson.node_monitors.ResponseTimeMonitorTest.skipOfflineAgent
    • https://github.com/jenkinsci/jenkins/pull/10654
    • linux-jdk21 / Linux - JDK 21 - Build / Test / jenkins.widgets.BuildTimeTrendTest.withAbstractJob_OnBoth
    • linux-jdk17 / Linux - JDK 17 - Build / Test / hudson.widgets.HistoryWidgetTest.displayFilterInput
  • https://ci.jenkins.io/job/Core/job/jenkins/job/master/7357/testReport/ / a5a61cd
    • windows-jdk17 / Windows - JDK 17 - Build / Test / hudson.cli.ComputerStateTest.testUiForConnected
  • https://ci.jenkins.io/job/Core/job/jenkins/job/master/7354/testReport/ / e87723b
    • linux-jdk17 / Linux - JDK 17 - Build / Test / hudson.model.DescriptorTest.nestedDescribableOverridingId
  • https://ci.jenkins.io/job/Core/job/jenkins/job/master/7350/testReport/ / 81df8e6
    • linux-jdk21 / Linux - JDK 21 - Build / Test / hudson.node_monitors.ResponseTimeMonitorTest.skipOfflineAgent
    • https://github.com/jenkinsci/jenkins/pull/10654
  • https://ci.jenkins.io/job/Core/job/jenkins/job/master/7349/testReport/ / 9ad563a
    • linux-jdk21 / Linux - JDK 21 - Build / Test / hudson.node_monitors.ResponseTimeMonitorTest.skipOfflineAgent
    • https://github.com/jenkinsci/jenkins/pull/10654
  • https://ci.jenkins.io/job/Core/job/jenkins/job/master/7342/testReport/
  • https://ci.jenkins.io/job/Core/job/jenkins/job/master/7334/testReport/
  • https://ci.jenkins.io/job/Core/job/jenkins/job/master/7329/testReport/
  • https://ci.jenkins.io/job/Core/job/jenkins/job/master/7321/testReport/
  • https://ci.jenkins.io/job/Core/job/jenkins/job/master/7320/testReport/

Only one test (#skipOfflineAgent) appears regularly, the others not.

It took seven attempts to get a single stable build (needed to have an incrementals deployment) in https://ci.jenkins.io/job/Core/job/jenkins/job/PR-10630/ with no two of those six unstable builds failing for the same tests.

Reproduction steps

No response

daniel-beck avatar May 15 '25 08:05 daniel-beck

@daniel-beck we (infra team) will need help to investigate this. We have no idea how the tests are working and what they are doing so it might be hard for us.

First step: we'll check the agent metrics (datadog AND azure) to see if we detect any contention. Usually I/O are the most common culprit with Jenkins but it's a gut feeling based on zero proof so we have to check carefully.

dduportal avatar May 15 '25 09:05 dduportal

Ideal outcome: the situation is stabilised AND there is a monitoring alerting (someone?) to prevent such situation to occur again

Wadeck avatar May 15 '25 09:05 Wadeck

Noting https://github.com/jenkinsci/jenkins/pull/10654 (Possible resilience improvement for (#skipOfflineAgent))

timja avatar May 15 '25 14:05 timja

@daniel-beck we (infra team) will need help to investigate this. We have no idea how the tests are working and what they are doing so it might be hard for us.

I don't think this is an infra team issue. The tests are unreliable. That is not an infrastructure problem.

Ideal outcome: the situation is stabilised AND there is a monitoring alerting (someone?) to prevent such situation to occur again

I already have lightweight monitoring of Jenkins core and key plugins but it only shows the current status without any alerting. Do we have Jenkins developer volunteers that are willing to be on a rotation to be notified of failures? I don't think we should expect the Jenkins infra team to fix reliability issues with Jenkins tests.

Here is a list of builds of the master branch and several pull requests merged with the master branch where tests are expected to all pass. Status reports show:

  • Build Status
  • https://github.com/jenkinsci/jenkins/pull/10390 Build Status
  • https://github.com/jenkinsci/jenkins/pull/10559 Build Status
  • https://github.com/jenkinsci/jenkins/pull/10561 Build Status
  • https://github.com/jenkinsci/jenkins/pull/10579 Build Status
  • https://github.com/jenkinsci/jenkins/pull/10580 Build Status
  • https://github.com/jenkinsci/jenkins/pull/10581 Build Status
  • https://github.com/jenkinsci/jenkins/pull/10582 Build Status
  • https://github.com/jenkinsci/jenkins/pull/10607 Build Status

MarkEWaite avatar May 16 '25 02:05 MarkEWaite

The two jobs that failed from the previous comment were both due to a flaky test that is proposed to fix with:

  • https://github.com/jenkinsci/jenkins/pull/10656

MarkEWaite avatar May 17 '25 13:05 MarkEWaite

BTW linking to Jenkins build results is not great since those get automatically deleted pretty quickly. Prefer to (additionally) link to GH Checks permalinks at least in those cases where they do include the stack trace & stdout/stderr.

jglick avatar May 19 '25 18:05 jglick

BTW linking to Jenkins build results is not great since those get automatically deleted pretty quickly. Prefer to (additionally) link to GH Checks permalinks at least in those cases where they do include the stack trace & stdout/stderr.

I listed failing tests in adding for the builds that still exist, added a few more recent ones, and linked to the built commits. That provides a reference to statuses as well.

daniel-beck avatar May 27 '25 07:05 daniel-beck