helpdesk Possible change to ci.jenkins.io's linux agents causes core test failure

Service(s)

ci.jenkins.io

Summary

Hey,

It appears that a change to the linux agents with Java 17 and 21 caused a test failure in Jenkins core. Around January the 5fth, MavenTest#sensitiveParameters() started to take much longer than usual.

Starting with 5730 the single test takes 229s (!!), exceeding our test timeout of 180s by far, causing builds to fail.

5728 notes ~39s for the entire MavenTest, build on January the 3rd.

Locally, the entire MavenTest takes less than 30s to complete for me using mvn clean verify -Dtest=hudson.tasks.MavenTest.

Interestingly, this affects linux agents with Java 17 and 21 only. The test passes in its usual time on windows agents with Java 17 and linux agents with Java 11: https://ci.jenkins.io/job/Core/job/jenkins/job/master/5730/pipeline-graph/

Reproduction steps

No response

Jan 06 '24 17:01 NotMyFault

I have just increased the timeout from 300s to 500s (!!!) because build 5736 reports a duration of 350s it takes to complete the test on a linux agent with Java 21.

Jan 07 '24 09:01 NotMyFault

Git plugin builds are failing due to 60 minute test timeout limit on the Windows agents. Inside the failed build logs is this message:

00:18:16  [WARNING] Could not validate integrity of download from https://repo.azure.jenkins.io/public/org/apache/mina/mina-core/2.2.3/mina-core-2.2.3.jar
00:18:16  org.eclipse.aether.transfer.ChecksumFailureException: Checksum validation failed, no checksums available
00:18:16      at org.eclipse.aether.internal.impl.AbstractChecksumPolicy.onNoMoreChecksums (AbstractChecksumPolicy.java:63)
00:18:16      at org.eclipse.aether.connector.basic.ChecksumValidator.validate (ChecksumValidator.java:112)
00:18:16      at org.eclipse.aether.connector.basic.BasicRepositoryConnector$GetTaskRunner.runTask (BasicRepositoryConnector.java:484)
00:18:16      at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run (BasicRepositoryConnector.java:414)
00:18:16      at org.eclipse.aether.util.concurrency.RunnableErrorForwarder.lambda$wrap$0 (RunnableErrorForwarder.java:66)
00:18:16      at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1136)
00:18:16      at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:635)
00:18:16      at java.lang.Thread.run (Thread.java:833)

Immediately after that message is logged, there is a multiple minute delay in log entries

When I replay the same job with the added Pipeline parameter useArtifactCachingProxy: false, the build passes.

Jan 07 '24 14:01 MarkEWaite

Likely related: https://github.com/jenkins-infra/helpdesk/issues/3889#issuecomment-1880591582

Jan 08 '24 09:01 lemeurherve

Likely related: #3889 (comment)

I've replayed a PR with the artifact opt out an the test failed with the specific reason still.

Jan 08 '24 09:01 NotMyFault

Likely related: #3889 (comment)

I've replayed a PR with the artifact opt out an the test failed with the specific reason still.

I believe @lemeurherve comment was aimed at @MarkEWaite (e.g. the message ould not validate integrity of download from https://repo.azure.jenkins.io/...).

About the problem mentioned in this issue on Jenkins Core (tests timeouts), there isn't anythin obvious in the image changelog: https://github.com/jenkins-infra/packer-images/releases/tag/1.45.0 . Might be related to any updated packages in the Ubuntu distribution for these images, or maybe something else.

We are diagnosing, but the only thing sure is that it appeared during the 5th of January: we still don't have any other proof for now

Jan 08 '24 09:01 dduportal

We've deployed a new image version (1.46.0) in https://github.com/jenkins-infra/jenkins-infra/pull/3246 with a bunch of upgraded Ubuntu packages (including kernel).

@NotMyFault Can you retry a build to see the behavior? (what changed since yesterday: ACP fully cleaned up + new template image).

Jan 08 '24 10:01 dduportal

Additionally, could be linked with https://github.com/jenkins-infra/helpdesk/issues/3874 (but I see that JDK11 and JDK17 are also having timeouts).

What is the test doing exactly? Is it requiring network operation "en-masse"?

Jan 08 '24 10:01 dduportal

I'll give it a shot later today. But I can build it normally with the acp enabled?

Jan 08 '24 11:01 NotMyFault

I'll give it a shot later today. But I can build it normally with the acp enabled?

I'm not sure to fully understand your question, could you clarify?

Jan 08 '24 11:01 dduportal

Removing the timeout increase results in the same outcome as before: https://ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/1/

The test case takes 230s still.

Jan 08 '24 15:01 NotMyFault

A first check of the resource usage centered on the JDK21 case:

Successful build on the master branch, the 3rd of January: https://ci.jenkins.io/job/Core/job/jenkins/job/master/5728
- CPU usage:

Memory usage:

vs.

First build of the PR above (https://ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/1/)
- CPU usage:

Memory usage:

Jan 08 '24 15:01 dduportal

=> not so much differences on the resource usages. it's really weird 🤔

Jan 08 '24 15:01 dduportal

@NotMyFault I'm going to rollback to 1.44.0 (the image used before the 5th of January) to see if we can return to previous behavior

Jan 08 '24 15:01 dduportal

@NotMyFault I'm going to rollback to 1.44.0 (the image used before the 5th of January) to see if we can return to previous behavior

Launched https://ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/2/ with the 1.44.0 version. Let's see

Jan 08 '24 15:01 dduportal

@NotMyFault I'm going to rollback to 1.44.0 (the image used before the 5th of January) to see if we can return to previous behavior

Launched ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/2 with the 1.44.0 version. Let's see

https://ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/2/testReport/ :(

Jan 08 '24 19:01 NotMyFault

@NotMyFault I'm going to rollback to 1.44.0 (the image used before the 5th of January) to see if we can return to previous behavior

Launched ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/2 with the 1.44.0 version. Let's see

https://ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/2/testReport/ :(

Ok so it is not the image. Gotta check other things then

Jan 08 '24 19:01 dduportal

I backported the WithTimeout change from @NotMyFault to the 2.426.3 backporting pull request so that the 2.426.3 release candidate on Wednesday 10 Jan 2024 would not risk being delayed. The change is purely in a test, so very low risk. The change is https://github.com/jenkinsci/jenkins/pull/8843/commits/914f0d46ebd92ab718b0d657a7dd241fc4e029d4

Jan 09 '24 04:01 MarkEWaite

@dduportal Following up the chat in matrix, I've replayed the PR with the reverted image and the test is no longer failing on any JDK.

Jan 09 '24 14:01 NotMyFault

@NotMyFault can you confirm in the next 48h it is still ok by closing the issue with a comment?

Jan 09 '24 18:01 dduportal

@NotMyFault can you confirm in the next 48h it is still ok by closing the issue with a comment?

Wouldn't bumping the template again in the future result in a failing test, as before?

Jan 09 '24 18:01 NotMyFault

@NotMyFault can you confirm in the next 48h it is still ok by closing the issue with a comment?

Wouldn't bumping the template again in the future result in a failing test, as before?

I don’t see the causality here. It was the same day (eg apparent correlation) but using 1.44.0 (previous) template also triggered the timeout issue which shows it is unrelated. Then switching back to 1.45.0 showed the issue fixed and 1.46 also.

it means we don’t really know what caused but clearly not the template.

however we can keep searching what changed. The network is a good clue as we changed a lot of things these past days.

Jan 10 '24 06:01 dduportal

it means we don’t really know what caused but clearly not the template.

I guess we can close the issue for now. The timeout is set to 500s and thanks to @MarkEWaite this change is present on the current LTS line too, preventing future issues.

Jan 10 '24 09:01 NotMyFault

There should be an issue tracking the removal of the workaround.

Jan 10 '24 15:01 basil

I have filed JENKINS-72577 to track the removal of the workaround added in https://github.com/jenkinsci/jenkins/pull/8840. That JENKINS-72577 in turn depends on the resolution of the operational issue that led to the workaround, tracked in this (now reopened) ticket.

Jan 17 '24 15:01 basil

The issue is marked as related to ACP, but I still haven't found any clue that it is (or it is not). Did you see anything pointing in its direction?

Jan 17 '24 16:01 dduportal

The very slow test runtime is only happening on CI infrastructure, as the test takes only 5 seconds for me locally, even when running in a clean Docker container without a warm Maven cache. Whether the operational issue is due to ACP or some other area of infrastructure I cannot be sure, but I suspect ACP because the test is invoking a second Maven process, and we have seen various ACP complications in the past running Maven within Maven such as https://github.com/jenkinsci/maven-hpi-plugin/issues/541. I have updated the wording of JENKINS-72577 to clarify the above.

Jan 17 '24 16:01 basil

The very slow test runtime is only happening on CI infrastructure, as the test takes only 5 seconds for me locally, even when running in a clean Docker container without a warm Maven cache. Whether the operational issue is due to ACP or some other area of infrastructure I cannot be sure, but I suspect ACP because the test is invoking a second Maven process, and we have seen various ACP complications in the past running Maven within Maven such as jenkinsci/maven-hpi-plugin#541. I have updated the wording of JENKINS-72577 to clarify the above.

No problem: I did not had enough time (neither the rest of the team) to check in details.

Reopening the issue is the correct move (thanks for doing it!) as the problem is present on ci.jenkins.io infrastructure: we acknowledge it.

My question was really in case you already saw things that we missed. Thanks for the explanation

Jan 17 '24 17:01 dduportal

Opened https://github.com/jenkinsci/jenkins/pull/9374 as the tests are now around ~5s on Linux with JDK21, following 4 months of infrastructure changes (see PR message).

Jun 11 '24 09:06 dduportal

Closing the infra issue as we confirmed the problem is now gone.

Besides, https://github.com/jenkinsci/jenkins/pull/9374 has been approved and will merged, closing https://issues.jenkins.io/browse/JENKINS-72577

Jun 11 '24 11:06 dduportal

helpdesk helpdesk copied to clipboard

Possible change to ci.jenkins.io's linux agents causes core test failure

Service(s)

Summary

Reproduction steps

helpdesk
helpdesk copied to clipboard