helpdesk
helpdesk copied to clipboard
Possible change to ci.jenkins.io's linux agents causes core test failure
Service(s)
ci.jenkins.io
Summary
Hey,
It appears that a change to the linux agents with Java 17 and 21 caused a test failure in Jenkins core. Around January the 5fth, MavenTest#sensitiveParameters() started to take much longer than usual.
Starting with 5730 the single test takes 229s (!!), exceeding our test timeout of 180s by far, causing builds to fail.
5728 notes ~39s for the entire MavenTest, build on January the 3rd.
Locally, the entire MavenTest takes less than 30s to complete for me using mvn clean verify -Dtest=hudson.tasks.MavenTest.
Interestingly, this affects linux agents with Java 17 and 21 only. The test passes in its usual time on windows agents with Java 17 and linux agents with Java 11: https://ci.jenkins.io/job/Core/job/jenkins/job/master/5730/pipeline-graph/
Reproduction steps
No response
I have just increased the timeout from 300s to 500s (!!!) because build 5736 reports a duration of 350s it takes to complete the test on a linux agent with Java 21.
Git plugin builds are failing due to 60 minute test timeout limit on the Windows agents. Inside the failed build logs is this message:
00:18:16 [WARNING] Could not validate integrity of download from https://repo.azure.jenkins.io/public/org/apache/mina/mina-core/2.2.3/mina-core-2.2.3.jar
00:18:16 org.eclipse.aether.transfer.ChecksumFailureException: Checksum validation failed, no checksums available
00:18:16 at org.eclipse.aether.internal.impl.AbstractChecksumPolicy.onNoMoreChecksums (AbstractChecksumPolicy.java:63)
00:18:16 at org.eclipse.aether.connector.basic.ChecksumValidator.validate (ChecksumValidator.java:112)
00:18:16 at org.eclipse.aether.connector.basic.BasicRepositoryConnector$GetTaskRunner.runTask (BasicRepositoryConnector.java:484)
00:18:16 at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run (BasicRepositoryConnector.java:414)
00:18:16 at org.eclipse.aether.util.concurrency.RunnableErrorForwarder.lambda$wrap$0 (RunnableErrorForwarder.java:66)
00:18:16 at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1136)
00:18:16 at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:635)
00:18:16 at java.lang.Thread.run (Thread.java:833)
Immediately after that message is logged, there is a multiple minute delay in log entries
When I replay the same job with the added Pipeline parameter useArtifactCachingProxy: false, the build passes.
Likely related: https://github.com/jenkins-infra/helpdesk/issues/3889#issuecomment-1880591582
Likely related: #3889 (comment)
I've replayed a PR with the artifact opt out an the test failed with the specific reason still.
Likely related: #3889 (comment)
I've replayed a PR with the artifact opt out an the test failed with the specific reason still.
I believe @lemeurherve comment was aimed at @MarkEWaite (e.g. the message ould not validate integrity of download from https://repo.azure.jenkins.io/...).
About the problem mentioned in this issue on Jenkins Core (tests timeouts), there isn't anythin obvious in the image changelog: https://github.com/jenkins-infra/packer-images/releases/tag/1.45.0 . Might be related to any updated packages in the Ubuntu distribution for these images, or maybe something else.
We are diagnosing, but the only thing sure is that it appeared during the 5th of January: we still don't have any other proof for now
We've deployed a new image version (1.46.0) in https://github.com/jenkins-infra/jenkins-infra/pull/3246 with a bunch of upgraded Ubuntu packages (including kernel).
@NotMyFault Can you retry a build to see the behavior? (what changed since yesterday: ACP fully cleaned up + new template image).
Additionally, could be linked with https://github.com/jenkins-infra/helpdesk/issues/3874 (but I see that JDK11 and JDK17 are also having timeouts).
What is the test doing exactly? Is it requiring network operation "en-masse"?
I'll give it a shot later today. But I can build it normally with the acp enabled?
I'll give it a shot later today. But I can build it normally with the acp enabled?
I'm not sure to fully understand your question, could you clarify?
Removing the timeout increase results in the same outcome as before: https://ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/1/
The test case takes 230s still.
A first check of the resource usage centered on the JDK21 case:
-
Successful build on the
masterbranch, the 3rd of January: https://ci.jenkins.io/job/Core/job/jenkins/job/master/5728- CPU usage:
- Memory usage:
vs.
- First build of the PR above (https://ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/1/)
- CPU usage:
- Memory usage:
=> not so much differences on the resource usages. it's really weird 🤔
@NotMyFault I'm going to rollback to 1.44.0 (the image used before the 5th of January) to see if we can return to previous behavior
@NotMyFault I'm going to rollback to 1.44.0 (the image used before the 5th of January) to see if we can return to previous behavior
Launched https://ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/2/ with the 1.44.0 version. Let's see
@NotMyFault I'm going to rollback to 1.44.0 (the image used before the 5th of January) to see if we can return to previous behavior
Launched ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/2 with the 1.44.0 version. Let's see
https://ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/2/testReport/ :(
@NotMyFault I'm going to rollback to 1.44.0 (the image used before the 5th of January) to see if we can return to previous behavior
Launched ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/2 with the 1.44.0 version. Let's see
https://ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/2/testReport/ :(
Ok so it is not the image. Gotta check other things then
I backported the WithTimeout change from @NotMyFault to the 2.426.3 backporting pull request so that the 2.426.3 release candidate on Wednesday 10 Jan 2024 would not risk being delayed. The change is purely in a test, so very low risk.
The change is https://github.com/jenkinsci/jenkins/pull/8843/commits/914f0d46ebd92ab718b0d657a7dd241fc4e029d4
@dduportal Following up the chat in matrix, I've replayed the PR with the reverted image and the test is no longer failing on any JDK.
@NotMyFault can you confirm in the next 48h it is still ok by closing the issue with a comment?
@NotMyFault can you confirm in the next 48h it is still ok by closing the issue with a comment?
Wouldn't bumping the template again in the future result in a failing test, as before?
@NotMyFault can you confirm in the next 48h it is still ok by closing the issue with a comment?
Wouldn't bumping the template again in the future result in a failing test, as before?
I don’t see the causality here. It was the same day (eg apparent correlation) but using 1.44.0 (previous) template also triggered the timeout issue which shows it is unrelated. Then switching back to 1.45.0 showed the issue fixed and 1.46 also.
it means we don’t really know what caused but clearly not the template.
however we can keep searching what changed. The network is a good clue as we changed a lot of things these past days.
it means we don’t really know what caused but clearly not the template.
I guess we can close the issue for now. The timeout is set to 500s and thanks to @MarkEWaite this change is present on the current LTS line too, preventing future issues.
There should be an issue tracking the removal of the workaround.
I have filed JENKINS-72577 to track the removal of the workaround added in https://github.com/jenkinsci/jenkins/pull/8840. That JENKINS-72577 in turn depends on the resolution of the operational issue that led to the workaround, tracked in this (now reopened) ticket.
The issue is marked as related to ACP, but I still haven't found any clue that it is (or it is not). Did you see anything pointing in its direction?
The very slow test runtime is only happening on CI infrastructure, as the test takes only 5 seconds for me locally, even when running in a clean Docker container without a warm Maven cache. Whether the operational issue is due to ACP or some other area of infrastructure I cannot be sure, but I suspect ACP because the test is invoking a second Maven process, and we have seen various ACP complications in the past running Maven within Maven such as https://github.com/jenkinsci/maven-hpi-plugin/issues/541. I have updated the wording of JENKINS-72577 to clarify the above.
The very slow test runtime is only happening on CI infrastructure, as the test takes only 5 seconds for me locally, even when running in a clean Docker container without a warm Maven cache. Whether the operational issue is due to ACP or some other area of infrastructure I cannot be sure, but I suspect ACP because the test is invoking a second Maven process, and we have seen various ACP complications in the past running Maven within Maven such as jenkinsci/maven-hpi-plugin#541. I have updated the wording of JENKINS-72577 to clarify the above.
No problem: I did not had enough time (neither the rest of the team) to check in details.
Reopening the issue is the correct move (thanks for doing it!) as the problem is present on ci.jenkins.io infrastructure: we acknowledge it.
My question was really in case you already saw things that we missed. Thanks for the explanation
Opened https://github.com/jenkinsci/jenkins/pull/9374 as the tests are now around ~5s on Linux with JDK21, following 4 months of infrastructure changes (see PR message).
Closing the infra issue as we confirmed the problem is now gone.
Besides, https://github.com/jenkinsci/jenkins/pull/9374 has been approved and will merged, closing https://issues.jenkins.io/browse/JENKINS-72577