infrastructure
infrastructure copied to clipboard
Ansible request for <AIX> x11 setup
Please put the name of the software product (and affected platforms if relevant) in the title of this issue
- [ ] x11 setup
Details:
java/beans/XMLEncoder/* failed on AIX jdk16 with java.awt.AWTError: Can't connect to X11 window server using ':0' as the value of the DISPLAY variable
Details https://github.com/adoptium/aqa-tests/issues/2810
@sxa
As far as I can see, the log that showed this as https://ci.adoptopenjdk.net/job/Test_openjdk16_hs_extended.openjdk_ppc64_aix_testList_1/9/consoleFull says:
23:59:25 + nohup /usr/bin/X11/X -force -vfb -x abx -x dbe -x GLX :0
23:59:25
23:59:25 Fatal server error:
23:59:25 Cannot establish any listening sockets - Make sure an X server isn't already running
I have logged onto the machine that showed the error - aix71-2 - and there is nothing stopping that process starting up properly. Has it been seen anywhere else i.e. is it reproducible, or could this have been a case where the machine had a leftover process, possibly from a previously terminated job, that was stopping it from starting up properly? I seem to be able to start an X -vfb server on that machine without problems.
This is a consistent issue and believe happens to all AIX. test-ibm-aix71-ppc64-1 https://ci.adoptopenjdk.net/job/Test_openjdk16_hs_extended.openjdk_ppc64_aix_testList_0/9/#showFailuresLink
test-osuosl-aix72-ppc64-2 https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/269/
This is a consistent issue and believe happens to all AIX. test-ibm-aix71-ppc64-1 https://ci.adoptopenjdk.net/job/Test_openjdk16_hs_extended.openjdk_ppc64_aix_testList_0/9/#showFailuresLink
That is the one I mentioned above from four weeks ago - I was interested to know if it had been seen at any other time
test-osuosl-aix72-ppc64-2 https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/269/
That was a run from your branch where you explicitly put in an override to set the DISPLAY to an incorrect value (You can see from the line above your change that the virtual X server is started on :0 and you're setting your tests to run against a non-existant :1)
aix72-1 had a leftover process from August 6th which was stopping it from starting a new one. That has also now been cleared but we need the test suite modified to be able to handle this situation - it is NOT an infrastructure request for an installation on the machine :-)
Rerun with test-ibm-aix71-ppc64-1: https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/283/ test-osuosl-aix72-ppc64-2: https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/282/
I can see the failure with test-osuosl-aix72-ppc64-2 since July 4th https://ci.adoptopenjdk.net/job/Test_openjdk16_hs_extended.openjdk_ppc64_aix_testList_0/6/testReport/junit/java_beans_XMLEncoder_Test4652928/java/Test4652928/.
build-osuosl-aix71-ppc64-2 passed on https://ci.adoptopenjdk.net/job/Test_openjdk16_hs_extended.openjdk_ppc64_aix_testList_0/8/testReport/java_beans_XMLEncoder_Test4631471/ and failed on https://ci.adoptopenjdk.net/job/Test_openjdk16_hs_extended.openjdk_ppc64_aix_testList_1/9/testReport/junit/java_beans_XMLEncoder_Test4631471/java/Test4631471/
Rerun on build-osuosl-aix71-ppc64-2 https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/284/
The issue happened to different machines, it is definitely reproducible. What is the leftover process on aix72-1, could you confirm if it is a leftover process created by openjdk tests? As for jenkins DISPLAY has been reset when jenkins job is done. https://github.com/adoptium/aqa-tests/pull/1835/files
Rerun test java/beans/XMLEncoder/on test-ibm-aix71-ppc64-1 and test-osuosl-aix72-ppc64-2 both passed. A second rerun passed too, which means if there is a leftover process it's not created by test java/beans/XMLEncoder/. We probably need to know how the leftover process is created. https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/285/ https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/286/
What is the leftover process on aix72-1, could you confirm if it is a leftover process created by openjdk tests?
It'll be the X -vfb process that the test suite starts up before running anything.
@smlambert Has this been discussed in the AQAvit meetings? We'll need to find a way to ensure the X server is terminated at the end of the job, which it may not be at present. Do we have a post-test clean-up phase that we could add this too?
@Haroon-Khel It's possible this was introduced as a result of https://github.com/adoptium/aqa-tests/pull/1835 although that was from over a year ago now, so I wonder if it's possible that the nohup is preventing this from being terminated once the jenkins job ends.
While adding a different port number would probably work around this issue it will result in process leaks so I'd be reluctant to implement the changes proposed in https://github.com/adoptium/aqa-tests/pull/2831 for this
In the 'post' stage of a test pipeline, for platforms that use the xvfb plugin (all linux platforms), the plugin closes/cleans up the process. For AIX, that plugin does not work, so Xvfb is manually launched and I presume https://github.com/adoptium/aqa-tests/pull/2831 is meant to both address the security scan issue of the process running, but also clean up the process in the post stage for that platform.
so to be clear, https://github.com/adoptium/aqa-tests/blob/dce1f080f4e7fb1b69b429982aa62e71f54d2a9d/buildenv/jenkins/JenkinsfileBase#L602 is definitely NOT used on Linux because it's started via the jenkins plgin?
This is the line that invokes/starts the Jenkins xvfb plugin:https://github.com/adoptium/aqa-tests/blob/dce1f080f4e7fb1b69b429982aa62e71f54d2a9d/buildenv/jenkins/JenkinsfileBase#L604
Gotcha - I hadn't read that syntax as being an invocation of stuff from the plugin. I don't believe that 2821 does anything to address the cleanup, only attempt to cycle the port number so it doesn't hit any leftover one (which is solving the wrong problem IMHO!)
If that post section you reference is executed after each tests would that be a valid place to attempt to kill off the X -vfb process on AIX?
Possible solution in https://github.com/adoptium/aqa-tests/pull/2892, but I think we need to determine if the current code is always leaving the process around or not
Hmmm even without that change an aborted job still cleaned up the Xvfb process. I'm tempted to leave this, keep a regular eye on it, and try and see which jobs are causing any such processes to be left behind. We also have the option of trying to re-use any existing Xvfb and not just crashing if it can't launch a second on the same DISPLAY if it's owned by the originating user.
I'll take a look again. iirc, what I saw is that (usually) the X VFB process stopped itself shortly after the job finished. When it did continue to run it took PID 1 as PPID.
Just adding a comment - the scans done at OSUOSL are still picking up on port 6000 - so regardless of what has been done (or not done) - the issue is still active (as of 11 October 2021)
I'll go back to my PR - and undo the 'generic' code - ie, choosing a port other than 6000 (https://github.com/adoptium/aqa-tests/pull/2831) - and only use the -secIP argument - and hopefully, the issue with the scan is gone (but not the hanging process).
FYI: about to kill process - but on ojdk05 this has been hanging since October 10th:
root@p9-aix1-ojdk05:[/root]ps auxwww | grep 34275466
jenkins 34275466 0.6 0.0 9596 8188 - A Oct 10 1792:56 /usr/bin/X11/X -force -vfb -x abx -x dbe -x GLX :0
https://ci.adoptopenjdk.net/job/Test_openjdk11_hs_extended.openjdk_ppc64_aix_testList_1/45/ on test-osuosl-aix72-ppc64-1
a) This issue (Ansible request) - asis - can be closed, as it is not the problem (AIX X11 configuration). b) perhaps a new issue needs to be opened to 'triple' verify there are no other X11 vfb processes running - and/or - adopt my earlier PR that randomizes the port number so that in principle multiple runs could be performed.
In any case - this is not related to ansible playbooks and the issue cannot be resolved via a playbook change.
The processCheck job should pick up on incidents of the server process being left around so we should try and keep an eye on that to see if it occurs. I haven't heard of any issues with this recently though.
Closing due to the lack of problems being highlighted recently.