temurin-build icon indicating copy to clipboard operation
temurin-build copied to clipboard

RISC-V build plan

Open luhenry opened this issue 1 year ago • 9 comments

In our current building and testing of Temurin on RISC-V, the main limiting factor to get to GA is the limited access to RISC-V boards. We are hoping to have access to more in the future, but even the ones we have access to today are slow or not available in enough quantity.

To alleviate the pressure on the pool of RISC-V boards we have, I am exploring building Temurin on QEMU on an aarch64/x86 host. The first succesfull run can be found at https://ci.adoptium.net/job/build-scripts/job/jobs/job/evaluation/job/jobs/job/jdk17u/job/jdk17u-evaluation-linux-riscv64-temurin/34/.

This work relies on the following PRs:

  • [x] https://github.com/adoptium/ci-jenkins-pipelines/pull/836
  • [x] https://github.com/adoptium/ci-jenkins-pipelines/pull/864
  • [x] https://github.com/adoptium/temurin-build/pull/3590
  • [x] https://github.com/adoptium/ci-jenkins-pipelines/pull/867
  • [x] https://github.com/adoptium/aqa-tests/pull/4931
  • [ ] https://github.com/adoptium/temurin-build/issues/3911
  • [ ] https://github.com/adoptium/infrastructure/issues/3634

luhenry avatar Dec 20 '23 15:12 luhenry

the main limiting factor to get to GA is the limited access to RISC-V boards

Hmmm just to give my view on this I would actually somewhat dispute that assertion as it's written - we do have over a dozen boards of various types in the CI and while they may not be highly performant they are generally capable of producing useful test output so I don't think that's directly blocking progress - but a few other things likely are: To my mind the real limiting factors which I've seen while working on this are:

  • Whether anyone is actively looking at any test results and working to get them to green, since the gating factor for a Temurin GA is having passing AQA and TCK runs. (I'd suggest that the most Important priorities would be sanity.openjdk 17 21 22 and extended.openjdk 17 21 22)
  • Making sure the build and test runs are scheduled somewhere on a regular basis (Generally that will be weekly for the ones we're interested in - this PR should re-enable it for JDK21)
  • Making sure the build image has the appropriate boot JDKs available in it, or that they are downloadable during the build (build issue 3378 is currently preventing the boot JDK download from succeeding I believe when it's not on the machines)
  • Understanding which openjdk releases are the priority (I'd have said 21 as the latest LTS, although I know you've been working a lot with 17 recently Ludovic)
  • Try to ensure that we do not not end up blocked with the jobs trying to fire up a dynamic agent that our CI cannot currently do (as per what you've seen earlier in this thread - I've got a few of the unmatched boards started up again so there is an ok capacity there but I need to clear up some of the obsolete 2104 agent definitions under ci.role.tess&&hw.arch.riscv but we could certainly see if we can set up a way to use your new machine to spine them up if we're at capacity as per your aqa-tests PR but need to make sure it doesn't break the existing users of that outside Adoptium's CI
  • Some of the unmatched boards in the CI are still having slowness in their network transfer causing timeouts - we can look at increasing that, but I've also alerted PLCTlab in the last week since it's inconsistent but hopefully something they can resolve on their side. The do seem to proceed though so I'm tempted to increase the test timeout for the copyArtifacts stage for this platform
  • I think we've still also got an issue with the core counts not being correctly detected so the riscv64 test jobs so typically use concurrency:1 which makes them slower which they should do (this means the sanity.openjdk jobs typically hit the default 10 hours timeout) I'd been experimenting with this branch of aqa-tests which has given us some results for sanity.openjd that's currently failing with being unable to find the correct jtreg version on our dependencies job

Noting that on the first point, TRSS can help with the test analysis, although that has a prereq on the builds being scheduled regularly via the jobs such as https://ci.adoptium.net/job/build-scripts/job/evaluation-openjdk21-pipeline/ (Should be fixed as per the PR in the second bullet - note that's not currently publicly visible but we should fix that) but if it's useful we could potentially also have a tab on the ci.adoptium.net page for RISC-V which showed just the build and test jobs for that platform to make it easy to find the important ones to look at.

Obviously, proving it can pass in an RVV1.0 environment is highly desirable too (and a reasonable goal which could be solved with static docker containers or the dynamic ones from the second last bullet point) but if it doesn't fully pass anywhere we've got a bigger problem to solve :slightly_smiling_face:

sxa avatar Dec 20 '23 17:12 sxa

AIs from offline discussion:

  • [x] @luhenry Provide bootjdk for jdk17u from https://ci.adoptium.net/job/build-scripts/job/jobs/job/evaluation/job/jobs/job/jdk17u/job/jdk17u-evaluation-linux-riscv64-temurin/34/ to docker image
    • https://adoptium.slack.com/archives/C016JNC6SDU/p1703168726356649
    • https://github.com/adoptium/infrastructure/pull/3308
  • [x] @luhenry Provide bootjdk for jdk21u from https://api.adoptium.net/v3/binary/version/jdk-21.0.1+12.1-ea-beta/linux/riscv64/jdk/hotspot/normal/adoptium to docker image
    • WIP, will open a PR once I've verified the docker image builds properly locally
    • https://github.com/adoptium/infrastructure/pull/3308
    • https://github.com/adoptium/ci-jenkins-pipelines/pull/869
  • [x] @luhenry Look at test results from sanity.openjdk 17 21 22 and extended.openjdk 17 21 22) + system/functional/perf + TCK ones
  • [x] @luhenry Build in headless mode (change needed in temurin-build)
    • https://github.com/adoptium/ci-jenkins-pipelines/pull/867
    • https://github.com/adoptium/aqa-tests/pull/4935
  • [x] @sxa Disable dynamic pool for testing, and queue test jobs for execution on boards
    • https://github.com/adoptium/aqa-tests/pull/4931
  • [x] @sxa Integrate https://github.com/adoptium/aqa-tests/compare/master...sxa:aqa-tests:use_more_cores
    • https://github.com/adoptium/aqa-tests/pull/4933
  • [x] @luhenry @sxa Schedule regularly builds for 17, 21, tip (and 22
    • https://github.com/adoptium/ci-jenkins-pipelines/pull/874
  • [ ] [stretch] @sxa Experiment with emulated containers with RVV for testing
  • [x] [stretch] @luhenry explore procuring a 7xLicheePi4a cluster

Notes:

  • Priorities for Rivos are 17, 21, tip. 11 when merged upstream
  • We want regular (weekly at first, ideally daily) build+tests on public CI, and weekly build+tests on private/TCK CI

luhenry avatar Dec 21 '23 12:12 luhenry

Provide bootjdk for jdk21u from TBD to docker image

For this, the best one to use is the one described in https://fosstodon.org/@sxa/111449356957539294 which was built in a way that will run in a container (still needs --security-opt secocomp=unconfined and on Ubuntu 20.04: https://api.adoptium.net/v3/binary/version/jdk-21.0.1+12.1-ea-beta/linux/riscv64/jdk/hotspot/normal/adoptium - that should be good as a bootstrap for JDK22 as well (The first build of that should happen on the next new tag in there, which is likely to be later today.

The other thing, which we didn't explicitly talk about, was that we'll need https://github.com/adoptium/temurin-build/issues/3378 fixed to be able to build the main jdk (jdk23 now) repository unless we also put a JDK22 into the image.

sxa avatar Dec 21 '23 12:12 sxa

I've created a RISC-V view at https://ci.adoptium.net/view/RISC-V/ as a convenient way of viewing the jobs we're interested in for the purposes of this so we can see how many of them are having problems

sxa avatar Dec 21 '23 14:12 sxa

Verified that 22 and 23 are now being triggered along with the other platforms but 23 is failing (as expected) due to the dirmngr error (third bullet point in the big list above)

sxa avatar Dec 22 '23 13:12 sxa

New docker build image with the updated JDKs is being pushed as I write this :-)

sxa avatar Dec 22 '23 13:12 sxa

I'm doing a full run on jdk21u at https://ci.adoptium.net/job/build-scripts/job/jobs/job/evaluation/job/jobs/job/jdk21u/job/jdk21u-evaluation-linux-riscv64-temurin/112/console and collecting the test failures into https://github.com/adoptium/aqa-tests/issues/4976.

I'll do a full run on jdk17u next and collect the test failures into https://github.com/adoptium/aqa-tests/issues/4976 as well.

luhenry avatar Jan 16 '24 18:01 luhenry

jdk17u pipeline:

sxa avatar May 10 '24 14:05 sxa

Noting that jdk11u is currently failing as the regular pipelines are building from tags which are not valid for that repository as the tags do not include the changes from the riscv-port branch. Covered in https://github.com/adoptium/temurin-build/issues/3911

sxa avatar Aug 14 '24 14:08 sxa