infrastructure icon indicating copy to clipboard operation
infrastructure copied to clipboard

Some s390x machines failing net tests with NoRouteToHostException

Open smlambert opened this issue 2 years ago • 21 comments

As described in https://github.com/adoptium/aqa-tests/issues/4039#issuecomment-1286415061

To make it easy for the infrastructure team to repeat and diagnose, please answer the following questions:

  • test suite/name (e.g, BUILD_LIST=openjdk, TARGET=jdk_net and other similar ones resulting in 70+ testcases failing)?
  • a link into recent Test_ job on https://ci.adoptopenjdk.net which showed the failure: https://ci.adoptopenjdk.net/job/Grinder/5897/
  • Hyperlink to re-run in Grinder: rerun link
  • Is there an existing issue elsewhere covering this? No, it is being seen as part of October release triage in https://github.com/adoptium/aqa-tests/issues/4039
  • Which machine(s) does it work on?
  • Which machine(s) does it fail on? https://ci.adoptopenjdk.net/computer/test-marist-sles15-s390x-2

Any other details:

smlambert avatar Nov 02 '22 13:11 smlambert

Yes and I have some reruns on various other machines going now as part of triage efforts, and will update the issue once results are in.

NoRouteToHostExceptions also seen on test-marist-sles15-s390x-2 - see https://ci.adoptopenjdk.net/job/Grinder/6086/

Those types of exceptions are not seen on test-marist-ubuntu2204-s390x-1, but other problems on that machine... issues appear to be mainly related to tests using multicast addresses. https://ci.adoptopenjdk.net/job/Grinder/6087/testReport/

smlambert avatar Nov 02 '22 16:11 smlambert

issues appear to be mainly related to tests using multicast addresses. https://ci.adoptopenjdk.net/job/Grinder/6087/testReport/

I have a fix I can try on there relating to the firewall configuration - this only occurs on the new Marist machines we've got and will allow multicast to work based on past experience - applied on the ubuntu2204-s390x-1 machine referred to above and regrinding at https://ci.adoptopenjdk.net/job/Grinder/6103/ to test

iptables -I INPUT -m pkttype --pkt-type multicast -j ACCEPT

[EDIT: This has resolved the problem - everything in java_net passed, although https://ci.adoptopenjdk.net/job/Grinder/6103/testReport/tools_jlink_JLinkReproducibleTest/java/JLinkReproducibleTest/ failed which is not likely to be related to this issue]

sxa avatar Nov 02 '22 23:11 sxa

Yes and I have some reruns on various other machines going now as part of triage efforts, and will update the issue once results are in.

NoRouteToHostExceptions also seen on test-marist-sles15-s390x-2 - see https://ci.adoptopenjdk.net/job/Grinder/6086/

Those types of exceptions are not seen on test-marist-ubuntu2204-s390x-1, but other problems on that machine... issues appear to be mainly related to tests using multicast addresses. https://ci.adoptopenjdk.net/job/Grinder/6087/testReport/

I'm going to re-grind that one after removing a someone rogue entry in /etc/hosts - I don't /think/ it will have made all those fail, but we'll see - depends exactly what the tests are doing in terms reverse host lookups ... https://ci.adoptopenjdk.net/job/Grinder/6104/ - If not it's going to need someone to do somre more low level debugging.

[EDIT: As expected no real change - [https://ci.adoptopenjdk.net/job/Grinder/6086/testReport/java_net_httpclient_http2_TLSConnection/java/TLSConnection/] passed in the new run, but that may have just been luck]

sxa avatar Nov 02 '22 23:11 sxa

Yes and I have some reruns on various other machines going now as part of triage efforts, and will update the issue once results are in. NoRouteToHostExceptions also seen on test-marist-sles15-s390x-2 - see https://ci.adoptopenjdk.net/job/Grinder/6086/ Those types of exceptions are not seen on test-marist-ubuntu2204-s390x-1, but other problems on that machine... issues appear to be mainly related to tests using multicast addresses. https://ci.adoptopenjdk.net/job/Grinder/6087/testReport/

I'm going to re-grind that one after removing a someone rogue entry in /etc/hosts - I don't /think/ it will have made all those fail, but we'll see - depends exactly what the tests are doing in terms reverse host lookups ... https://ci.adoptopenjdk.net/job/Grinder/6104/ - If not it's going to need someone to do somre more low level debugging.

seems still failing on java_net

zdtsw avatar Nov 03 '22 09:11 zdtsw

Node Grinder link Predominant type of failure
test-marist-rhel7-s390x-2 Grinder/6108 NoRouteToHostException
test-marist-rhel8-s390x-2 Grinder/6110 NoRouteToHostException
test-marist-sles12-s390x-2 Grinder/6112 NoRouteToHostException
test-marist-sles15-s390x-1 -- offline
test-marist-sles15-s390x-2 Grinder/6113 NoRouteToHostException
test-marist-ubuntu1604-s390x-1 Grinder/6102 offline
test-marist-ubuntu1804-s390x-1 -- offline
test-marist-ubuntu1804-s390x-2 -- offline
test-marist-ubuntu1804-s390x-3 -- offline
test-marist-ubuntu1804-s390x-4 Grinder/6101 offline
test-marist-ubuntu2004-s390x-1 Grinder/6111
test-marist-ubuntu2204-s390x-1 Grinder/6103 after fixes appled re: https://github.com/adoptium/infrastructure/issues/2807#issuecomment-1301484486, only JLinkReproducibleTest fails which is a problematic testcase that should get excluded JDK-8217166

smlambert avatar Nov 06 '22 14:11 smlambert

The above analysis suggests that we can resolve a lot of the issues on the RHEL/SLES systems by performing a similar firewall fix to assist the multicast packets to get through. It will be interesting to see how many other problems remain after doing that.

Bear in mind that many of the offline machines are the older ones which were replaced during September as part of the Marist machine migration which we have done, so that is expected (They've been offline in jenkins for a while, but now need to be fully removed)

sxa avatar Nov 07 '22 11:11 sxa

Of note is that they do not appear as "offline", https://ci.adoptopenjdk.net/label/hw.arch.s390x&&ci.role.test/ shows Screen Shot 2022-11-07 at 11 03 17 AM

where I would have expected to see the red X as with some other offline nodes: Screen Shot 2022-11-07 at 11 03 58 AM

smlambert avatar Nov 07 '22 16:11 smlambert

Re-runs on RHEL/SLES systems after adding the same iptables rule:

Node Grinder link Predominant type of failure
test-marist-rhel7-s390x-2 Grinder/6118 110 failures NoRouteToHost/Timeouts
test-marist-rhel8-s390x-2 Grinder/6122 1 failure - only JLinkReproducibleTest
test-marist-sles12-s390x-2 Grinder/6117 117 failures
test-marist-sles15-s390x-2 Grinder/6120 114 failures

sxa avatar Nov 08 '22 15:11 sxa

(Comment removed as it was supposed to be in https://github.com/adoptium/infrastructure/issues/2820)

sxa avatar Nov 15 '22 15:11 sxa

Tests running on test-marist-sles12-s390x-2 ( https://ci.adoptium.net/job/Grinder/7123/ ) with above IPTables fix. Tests do not pass, machine temporarily disabled.

steelhead31 avatar Apr 11 '23 14:04 steelhead31

test-marist-sles12-s390x-2 still a problem: https://ci.adoptium.net/job/Test_openjdk17_hs_extended.openjdk_s390x_linux_testList_0/86/

andrew-m-leonard avatar Apr 13 '23 08:04 andrew-m-leonard

@andrew-m-leonard 

test-marist-sles12-s390x-2 still a problem: https://ci.adoptium.net/job/Test_openjdk17_hs_extended.openjdk_s390x_linux_testList_0/86/

I had a look at this machine earlier this week, and disabled it... the network issues are fairly extensive, and the workaround doesn't appear to work, so it needs some more investigation and work.

steelhead31 avatar Apr 13 '23 09:04 steelhead31

Tests running on test-marist-sles12-s390x-2 ( https://ci.adoptium.net/job/Grinder/7123/ ) with above IPTables fix. Tests do not pass, machine temporarily disabled.

Labelling with systemdown on this basis

sxa avatar Apr 13 '23 09:04 sxa

@steelhead31 Is this something that will be fixed by the hostname changes you're putting in place or is it a separate routing problem?

sxa avatar Jun 01 '23 13:06 sxa

This is a seperate routing problem, Its unlikely my PR will fix this, but if we manage to find a suitable fix, that should persist, thanks to my changes.

steelhead31 avatar Jun 01 '23 14:06 steelhead31

@steelhead31 Can you take a look at this and see what would be suitable as some "next steps" to move this one forward please? It does look from the table in anearlier comment as though we may have failures almost everywhere (although I wonder what's happening in the docker containers...)

sxa avatar Jul 12 '23 11:07 sxa

Have narrowed down the error to this piece of code..

   try {
     csoc = new Socket(InetAddress.getLocalHost(), port);
   } catch(Exception e) {
     System.err.println("Failed. Unexpected exception:" + e);
     throw e;
   }

steelhead31 avatar Jul 17 '23 10:07 steelhead31

Update: This issue (or something like it) is still seen.

https://ci.adoptium.net/job/Test_openjdk11_hs_extended.openjdk_s390x_linux/140/

e.g. on https://ci.adoptium.net/computer/test-marist-sles12-s390x-2

[2023-10-08T07:59:44.878Z] Running test jdk_rmi_1 ...
...
[2023-10-08T07:30:30.158Z] java.lang.RuntimeException: java.rmi.ConnectIOException: Exception creating connection to: 148.100.74.193; nested exception is: 
[2023-10-08T07:30:30.158Z] 	java.net.NoRouteToHostException: No route to host (Host unreachable)

After a number of NoRouteToHostExceptions in other targets, the jdk_jfr_1 target appears to cause the entire job to fail, and I'm guessing it's related to this issue.

Have the other jobs associated with this issue failed as well? As in non-"unsafe" failed. Jenkins red job failed.

adamfarley avatar Oct 17 '23 15:10 adamfarley

Let's check if the outstanding problems are only on the SLES12 systems and whether they also occur in the docker SLES12 images that we have.

sxa avatar Nov 02 '23 10:11 sxa

March JDK22 release activities Grinder/9226 50 compiler testcases fail with no route to host issues on test-marist-sles12-s390x-2

FYI @steelhead31

smlambert avatar Mar 21 '24 18:03 smlambert