helpdesk icon indicating copy to clipboard operation
helpdesk copied to clipboard

Possible change to ci.jenkins.io's linux agents causes core test failure

Open NotMyFault opened this issue 1 year ago • 27 comments

Service(s)

ci.jenkins.io

Summary

Hey,

It appears that a change to the linux agents with Java 17 and 21 caused a test failure in Jenkins core. Around January the 5fth, MavenTest#sensitiveParameters() started to take much longer than usual.

Starting with 5730 the single test takes 229s (!!), exceeding our test timeout of 180s by far, causing builds to fail.

5728 notes ~39s for the entire MavenTest, build on January the 3rd.

Locally, the entire MavenTest takes less than 30s to complete for me using mvn clean verify -Dtest=hudson.tasks.MavenTest.

Interestingly, this affects linux agents with Java 17 and 21 only. The test passes in its usual time on windows agents with Java 17 and linux agents with Java 11: https://ci.jenkins.io/job/Core/job/jenkins/job/master/5730/pipeline-graph/

Reproduction steps

No response

NotMyFault avatar Jan 06 '24 17:01 NotMyFault

I have just increased the timeout from 300s to 500s (!!!) because build 5736 reports a duration of 350s it takes to complete the test on a linux agent with Java 21.

NotMyFault avatar Jan 07 '24 09:01 NotMyFault

Git plugin builds are failing due to 60 minute test timeout limit on the Windows agents. Inside the failed build logs is this message:

00:18:16  [WARNING] Could not validate integrity of download from https://repo.azure.jenkins.io/public/org/apache/mina/mina-core/2.2.3/mina-core-2.2.3.jar
00:18:16  org.eclipse.aether.transfer.ChecksumFailureException: Checksum validation failed, no checksums available
00:18:16      at org.eclipse.aether.internal.impl.AbstractChecksumPolicy.onNoMoreChecksums (AbstractChecksumPolicy.java:63)
00:18:16      at org.eclipse.aether.connector.basic.ChecksumValidator.validate (ChecksumValidator.java:112)
00:18:16      at org.eclipse.aether.connector.basic.BasicRepositoryConnector$GetTaskRunner.runTask (BasicRepositoryConnector.java:484)
00:18:16      at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run (BasicRepositoryConnector.java:414)
00:18:16      at org.eclipse.aether.util.concurrency.RunnableErrorForwarder.lambda$wrap$0 (RunnableErrorForwarder.java:66)
00:18:16      at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1136)
00:18:16      at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:635)
00:18:16      at java.lang.Thread.run (Thread.java:833)

Immediately after that message is logged, there is a multiple minute delay in log entries

When I replay the same job with the added Pipeline parameter useArtifactCachingProxy: false, the build passes.

MarkEWaite avatar Jan 07 '24 14:01 MarkEWaite

Likely related: https://github.com/jenkins-infra/helpdesk/issues/3889#issuecomment-1880591582

lemeurherve avatar Jan 08 '24 09:01 lemeurherve

Likely related: #3889 (comment)

I've replayed a PR with the artifact opt out an the test failed with the specific reason still.

NotMyFault avatar Jan 08 '24 09:01 NotMyFault

Likely related: #3889 (comment)

I've replayed a PR with the artifact opt out an the test failed with the specific reason still.

I believe @lemeurherve comment was aimed at @MarkEWaite (e.g. the message ould not validate integrity of download from https://repo.azure.jenkins.io/...).

About the problem mentioned in this issue on Jenkins Core (tests timeouts), there isn't anythin obvious in the image changelog: https://github.com/jenkins-infra/packer-images/releases/tag/1.45.0 . Might be related to any updated packages in the Ubuntu distribution for these images, or maybe something else.

We are diagnosing, but the only thing sure is that it appeared during the 5th of January: we still don't have any other proof for now

dduportal avatar Jan 08 '24 09:01 dduportal

We've deployed a new image version (1.46.0) in https://github.com/jenkins-infra/jenkins-infra/pull/3246 with a bunch of upgraded Ubuntu packages (including kernel).

@NotMyFault Can you retry a build to see the behavior? (what changed since yesterday: ACP fully cleaned up + new template image).

dduportal avatar Jan 08 '24 10:01 dduportal

Additionally, could be linked with https://github.com/jenkins-infra/helpdesk/issues/3874 (but I see that JDK11 and JDK17 are also having timeouts).

What is the test doing exactly? Is it requiring network operation "en-masse"?

dduportal avatar Jan 08 '24 10:01 dduportal

I'll give it a shot later today. But I can build it normally with the acp enabled?

NotMyFault avatar Jan 08 '24 11:01 NotMyFault

I'll give it a shot later today. But I can build it normally with the acp enabled?

I'm not sure to fully understand your question, could you clarify?

dduportal avatar Jan 08 '24 11:01 dduportal

Removing the timeout increase results in the same outcome as before: https://ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/1/

The test case takes 230s still.

NotMyFault avatar Jan 08 '24 15:01 NotMyFault

A first check of the resource usage centered on the JDK21 case:

  • Successful build on the master branch, the 3rd of January: https://ci.jenkins.io/job/Core/job/jenkins/job/master/5728

    • CPU usage:
Capture d’écran 2024-01-08 à 16 27 53
  • Memory usage:
Capture d’écran 2024-01-08 à 16 28 15

vs.

  • First build of the PR above (https://ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/1/)
    • CPU usage:
Capture d’écran 2024-01-08 à 16 26 41
  • Memory usage:
Capture d’écran 2024-01-08 à 16 26 28

dduportal avatar Jan 08 '24 15:01 dduportal

=> not so much differences on the resource usages. it's really weird 🤔

dduportal avatar Jan 08 '24 15:01 dduportal

@NotMyFault I'm going to rollback to 1.44.0 (the image used before the 5th of January) to see if we can return to previous behavior

dduportal avatar Jan 08 '24 15:01 dduportal

@NotMyFault I'm going to rollback to 1.44.0 (the image used before the 5th of January) to see if we can return to previous behavior

Launched https://ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/2/ with the 1.44.0 version. Let's see

dduportal avatar Jan 08 '24 15:01 dduportal

@NotMyFault I'm going to rollback to 1.44.0 (the image used before the 5th of January) to see if we can return to previous behavior

Launched ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/2 with the 1.44.0 version. Let's see

https://ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/2/testReport/ :(

NotMyFault avatar Jan 08 '24 19:01 NotMyFault

@NotMyFault I'm going to rollback to 1.44.0 (the image used before the 5th of January) to see if we can return to previous behavior

Launched ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/2 with the 1.44.0 version. Let's see

https://ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/2/testReport/ :(

Ok so it is not the image. Gotta check other things then

dduportal avatar Jan 08 '24 19:01 dduportal

I backported the WithTimeout change from @NotMyFault to the 2.426.3 backporting pull request so that the 2.426.3 release candidate on Wednesday 10 Jan 2024 would not risk being delayed. The change is purely in a test, so very low risk. The change is https://github.com/jenkinsci/jenkins/pull/8843/commits/914f0d46ebd92ab718b0d657a7dd241fc4e029d4

MarkEWaite avatar Jan 09 '24 04:01 MarkEWaite

@dduportal Following up the chat in matrix, I've replayed the PR with the reverted image and the test is no longer failing on any JDK.

NotMyFault avatar Jan 09 '24 14:01 NotMyFault

@NotMyFault can you confirm in the next 48h it is still ok by closing the issue with a comment?

dduportal avatar Jan 09 '24 18:01 dduportal

@NotMyFault can you confirm in the next 48h it is still ok by closing the issue with a comment?

Wouldn't bumping the template again in the future result in a failing test, as before?

NotMyFault avatar Jan 09 '24 18:01 NotMyFault

@NotMyFault can you confirm in the next 48h it is still ok by closing the issue with a comment?

Wouldn't bumping the template again in the future result in a failing test, as before?

I don’t see the causality here. It was the same day (eg apparent correlation) but using 1.44.0 (previous) template also triggered the timeout issue which shows it is unrelated. Then switching back to 1.45.0 showed the issue fixed and 1.46 also.

it means we don’t really know what caused but clearly not the template.

however we can keep searching what changed. The network is a good clue as we changed a lot of things these past days.

dduportal avatar Jan 10 '24 06:01 dduportal

it means we don’t really know what caused but clearly not the template.

I guess we can close the issue for now. The timeout is set to 500s and thanks to @MarkEWaite this change is present on the current LTS line too, preventing future issues.

NotMyFault avatar Jan 10 '24 09:01 NotMyFault

There should be an issue tracking the removal of the workaround.

basil avatar Jan 10 '24 15:01 basil

I have filed JENKINS-72577 to track the removal of the workaround added in https://github.com/jenkinsci/jenkins/pull/8840. That JENKINS-72577 in turn depends on the resolution of the operational issue that led to the workaround, tracked in this (now reopened) ticket.

basil avatar Jan 17 '24 15:01 basil

The issue is marked as related to ACP, but I still haven't found any clue that it is (or it is not). Did you see anything pointing in its direction?

dduportal avatar Jan 17 '24 16:01 dduportal

The very slow test runtime is only happening on CI infrastructure, as the test takes only 5 seconds for me locally, even when running in a clean Docker container without a warm Maven cache. Whether the operational issue is due to ACP or some other area of infrastructure I cannot be sure, but I suspect ACP because the test is invoking a second Maven process, and we have seen various ACP complications in the past running Maven within Maven such as https://github.com/jenkinsci/maven-hpi-plugin/issues/541. I have updated the wording of JENKINS-72577 to clarify the above.

basil avatar Jan 17 '24 16:01 basil

The very slow test runtime is only happening on CI infrastructure, as the test takes only 5 seconds for me locally, even when running in a clean Docker container without a warm Maven cache. Whether the operational issue is due to ACP or some other area of infrastructure I cannot be sure, but I suspect ACP because the test is invoking a second Maven process, and we have seen various ACP complications in the past running Maven within Maven such as jenkinsci/maven-hpi-plugin#541. I have updated the wording of JENKINS-72577 to clarify the above.

No problem: I did not had enough time (neither the rest of the team) to check in details.

Reopening the issue is the correct move (thanks for doing it!) as the problem is present on ci.jenkins.io infrastructure: we acknowledge it.

My question was really in case you already saw things that we missed. Thanks for the explanation

dduportal avatar Jan 17 '24 17:01 dduportal

Opened https://github.com/jenkinsci/jenkins/pull/9374 as the tests are now around ~5s on Linux with JDK21, following 4 months of infrastructure changes (see PR message).

dduportal avatar Jun 11 '24 09:06 dduportal

Closing the infra issue as we confirmed the problem is now gone.

Besides, https://github.com/jenkinsci/jenkins/pull/9374 has been approved and will merged, closing https://issues.jenkins.io/browse/JENKINS-72577

dduportal avatar Jun 11 '24 11:06 dduportal