infrastructure icon indicating copy to clipboard operation
infrastructure copied to clipboard

Ansible request for <AIX> x11 setup

Open sophia-guo opened this issue 4 years ago • 22 comments

Please put the name of the software product (and affected platforms if relevant) in the title of this issue

  • [ ] x11 setup

Details: java/beans/XMLEncoder/* failed on AIX jdk16 with java.awt.AWTError: Can't connect to X11 window server using ':0' as the value of the DISPLAY variable

Details https://github.com/adoptium/aqa-tests/issues/2810

sophia-guo avatar Aug 18 '21 13:08 sophia-guo

@sxa

sophia-guo avatar Aug 18 '21 13:08 sophia-guo

As far as I can see, the log that showed this as https://ci.adoptopenjdk.net/job/Test_openjdk16_hs_extended.openjdk_ppc64_aix_testList_1/9/consoleFull says:

23:59:25  + nohup /usr/bin/X11/X -force -vfb -x abx -x dbe -x GLX :0
23:59:25  
23:59:25  Fatal server error:
23:59:25  Cannot establish any listening sockets - Make sure an X server isn't already running

I have logged onto the machine that showed the error - aix71-2 - and there is nothing stopping that process starting up properly. Has it been seen anywhere else i.e. is it reproducible, or could this have been a case where the machine had a leftover process, possibly from a previously terminated job, that was stopping it from starting up properly? I seem to be able to start an X -vfb server on that machine without problems.

sxa avatar Aug 18 '21 15:08 sxa

This is a consistent issue and believe happens to all AIX. test-ibm-aix71-ppc64-1 https://ci.adoptopenjdk.net/job/Test_openjdk16_hs_extended.openjdk_ppc64_aix_testList_0/9/#showFailuresLink

test-osuosl-aix72-ppc64-2 https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/269/

sophia-guo avatar Aug 18 '21 18:08 sophia-guo

This is a consistent issue and believe happens to all AIX. test-ibm-aix71-ppc64-1 https://ci.adoptopenjdk.net/job/Test_openjdk16_hs_extended.openjdk_ppc64_aix_testList_0/9/#showFailuresLink

That is the one I mentioned above from four weeks ago - I was interested to know if it had been seen at any other time

test-osuosl-aix72-ppc64-2 https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/269/

That was a run from your branch where you explicitly put in an override to set the DISPLAY to an incorrect value (You can see from the line above your change that the virtual X server is started on :0 and you're setting your tests to run against a non-existant :1)

sxa avatar Aug 20 '21 15:08 sxa

aix72-1 had a leftover process from August 6th which was stopping it from starting a new one. That has also now been cleared but we need the test suite modified to be able to handle this situation - it is NOT an infrastructure request for an installation on the machine :-)

sxa avatar Aug 25 '21 12:08 sxa

Rerun with test-ibm-aix71-ppc64-1: https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/283/ test-osuosl-aix72-ppc64-2: https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/282/

sophia-guo avatar Aug 26 '21 20:08 sophia-guo

I can see the failure with test-osuosl-aix72-ppc64-2 since July 4th https://ci.adoptopenjdk.net/job/Test_openjdk16_hs_extended.openjdk_ppc64_aix_testList_0/6/testReport/junit/java_beans_XMLEncoder_Test4652928/java/Test4652928/.

build-osuosl-aix71-ppc64-2 passed on https://ci.adoptopenjdk.net/job/Test_openjdk16_hs_extended.openjdk_ppc64_aix_testList_0/8/testReport/java_beans_XMLEncoder_Test4631471/ and failed on https://ci.adoptopenjdk.net/job/Test_openjdk16_hs_extended.openjdk_ppc64_aix_testList_1/9/testReport/junit/java_beans_XMLEncoder_Test4631471/java/Test4631471/

Rerun on build-osuosl-aix71-ppc64-2 https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/284/

sophia-guo avatar Aug 26 '21 21:08 sophia-guo

The issue happened to different machines, it is definitely reproducible. What is the leftover process on aix72-1, could you confirm if it is a leftover process created by openjdk tests? As for jenkins DISPLAY has been reset when jenkins job is done. https://github.com/adoptium/aqa-tests/pull/1835/files

sophia-guo avatar Aug 26 '21 21:08 sophia-guo

Rerun test java/beans/XMLEncoder/on test-ibm-aix71-ppc64-1 and test-osuosl-aix72-ppc64-2 both passed. A second rerun passed too, which means if there is a leftover process it's not created by test java/beans/XMLEncoder/. We probably need to know how the leftover process is created. https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/285/ https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/286/

sophia-guo avatar Aug 27 '21 15:08 sophia-guo

What is the leftover process on aix72-1, could you confirm if it is a leftover process created by openjdk tests?

It'll be the X -vfb process that the test suite starts up before running anything.

sxa avatar Aug 31 '21 17:08 sxa

@smlambert Has this been discussed in the AQAvit meetings? We'll need to find a way to ensure the X server is terminated at the end of the job, which it may not be at present. Do we have a post-test clean-up phase that we could add this too?

@Haroon-Khel It's possible this was introduced as a result of https://github.com/adoptium/aqa-tests/pull/1835 although that was from over a year ago now, so I wonder if it's possible that the nohup is preventing this from being terminated once the jenkins job ends.

While adding a different port number would probably work around this issue it will result in process leaks so I'd be reluctant to implement the changes proposed in https://github.com/adoptium/aqa-tests/pull/2831 for this

sxa avatar Sep 23 '21 10:09 sxa

In the 'post' stage of a test pipeline, for platforms that use the xvfb plugin (all linux platforms), the plugin closes/cleans up the process. For AIX, that plugin does not work, so Xvfb is manually launched and I presume https://github.com/adoptium/aqa-tests/pull/2831 is meant to both address the security scan issue of the process running, but also clean up the process in the post stage for that platform.

smlambert avatar Sep 23 '21 14:09 smlambert

so to be clear, https://github.com/adoptium/aqa-tests/blob/dce1f080f4e7fb1b69b429982aa62e71f54d2a9d/buildenv/jenkins/JenkinsfileBase#L602 is definitely NOT used on Linux because it's started via the jenkins plgin?

sxa avatar Sep 23 '21 14:09 sxa

This is the line that invokes/starts the Jenkins xvfb plugin:https://github.com/adoptium/aqa-tests/blob/dce1f080f4e7fb1b69b429982aa62e71f54d2a9d/buildenv/jenkins/JenkinsfileBase#L604

smlambert avatar Sep 23 '21 14:09 smlambert

Gotcha - I hadn't read that syntax as being an invocation of stuff from the plugin. I don't believe that 2821 does anything to address the cleanup, only attempt to cycle the port number so it doesn't hit any leftover one (which is solving the wrong problem IMHO!) If that post section you reference is executed after each tests would that be a valid place to attempt to kill off the X -vfb process on AIX?

sxa avatar Sep 23 '21 15:09 sxa

Possible solution in https://github.com/adoptium/aqa-tests/pull/2892, but I think we need to determine if the current code is always leaving the process around or not

sxa avatar Sep 23 '21 16:09 sxa

Hmmm even without that change an aborted job still cleaned up the Xvfb process. I'm tempted to leave this, keep a regular eye on it, and try and see which jobs are causing any such processes to be left behind. We also have the option of trying to re-use any existing Xvfb and not just crashing if it can't launch a second on the same DISPLAY if it's owned by the originating user.

sxa avatar Sep 23 '21 16:09 sxa

I'll take a look again. iirc, what I saw is that (usually) the X VFB process stopped itself shortly after the job finished. When it did continue to run it took PID 1 as PPID.

aixtools avatar Oct 05 '21 09:10 aixtools

Just adding a comment - the scans done at OSUOSL are still picking up on port 6000 - so regardless of what has been done (or not done) - the issue is still active (as of 11 October 2021)

I'll go back to my PR - and undo the 'generic' code - ie, choosing a port other than 6000 (https://github.com/adoptium/aqa-tests/pull/2831) - and only use the -secIP argument - and hopefully, the issue with the scan is gone (but not the hanging process).

aixtools avatar Oct 19 '21 06:10 aixtools

FYI: about to kill process - but on ojdk05 this has been hanging since October 10th:

root@p9-aix1-ojdk05:[/root]ps auxwww | grep 34275466
jenkins  34275466  0.6  0.0 9596 8188      - A      Oct 10 1792:56 /usr/bin/X11/X -force -vfb -x abx -x dbe -x GLX :0

aixtools avatar Oct 19 '21 15:10 aixtools

https://ci.adoptopenjdk.net/job/Test_openjdk11_hs_extended.openjdk_ppc64_aix_testList_1/45/ on test-osuosl-aix72-ppc64-1

sophia-guo avatar Jul 07 '22 20:07 sophia-guo

a) This issue (Ansible request) - asis - can be closed, as it is not the problem (AIX X11 configuration). b) perhaps a new issue needs to be opened to 'triple' verify there are no other X11 vfb processes running - and/or - adopt my earlier PR that randomizes the port number so that in principle multiple runs could be performed.

In any case - this is not related to ansible playbooks and the issue cannot be resolved via a playbook change.

aixtools avatar Jul 19 '22 10:07 aixtools

The processCheck job should pick up on incidents of the server process being left around so we should try and keep an eye on that to see if it occurs. I haven't heard of any issues with this recently though.

sxa avatar Jan 27 '23 13:01 sxa

Closing due to the lack of problems being highlighted recently.

sxa avatar Feb 06 '23 13:02 sxa