iri icon indicating copy to clipboard operation
iri copied to clipboard

IPv6 neighbor failure in name resolution crashes IRI

Open MaKla89 opened this issue 5 years ago • 11 comments

Bug description

I'm operating an ipv6-IRI-Node with a total of 4 neighbors that usually work fine (or at least have been working and do so for some hours / days after a reboot).

After starting the node and waiting for some hours / days, journal outputs turns to 0-values only, like this:

08/25 11:08:09.409 [Tips Requester] INFO TipsRequesterImpl:86 - toProcess = 0 , toBroadcast = 0 , toRequest = 0 , toReply = 0 / totalTransactions = 4821933

Output of "nbctl -l" (which just makes a call to the IRI API for getNeighbors) during this crash-state shows 3 correct entries and for the last neighbor this:

{ "connected": false, "connectionType": "tcp", "numberOfAllTransactions": 0, "numberOfDroppedSentPackets": 0, "numberOfInvalidTransactions": 0, "numberOfNewTransactions": 0, "numberOfRandomTransactionRequests": 0, "numberOfSentTransactions": 0, "numberOfStaleTransactions": 0 }

When I try to ping the faulty neighbor, this happens :

ping [2001:xxx:xxxx:xxx:xxx:xxx]
ping: [2001:xxx:xxxx:xxx:xxx:xxx]: Temporary failure in name resolution

IRI version

1.8.1 (dockerized playbook)

Hardware Spec

Intel NUC - Celeron j3455 4-core, 8GB RAM, SSD

Steps To Reproduce

  1. Add ipv6-neighbor with likely non-static IPv6-adress
  2. wait
  3. check when "journalctl -fu iri" does only show 0-values for tx-counts

Expected behaviour

IRI should continue operation for the 3 working neighbors and just ignore the neighbor with name resolution issues.

Actual behaviour

According to the journal, all IRI-acitivities drop to 0 TX. Restarting the node resolves the issue for some time (likely until next ipv6-Update from "faulty" neighbor?)

Errors

"Temporary failure in name resolution"

MaKla89 avatar Aug 25 '19 11:08 MaKla89

Just to add: it seems that getNeighbors in the above case returns a neighbor missing address field. I guess this would also be the case when name resolution fails for IPv4?

I think it would be very useful to display all the neighbors, always, and add a new field called display_name. This should be the name with which a user has added a neighbor. address and domain fields can remain, and whether a connection is established is anyway shown in connected field.

nuriel77 avatar Aug 25 '19 11:08 nuriel77

The error-state was triggered again yesterday evening and is consistent until today (since I did not restart so far). I've taken screenshots from the reverse IRI-Logfile: IRI-Log

The important part seems to be "Exception in thread "Neighbor Router" java.lang.IllegalArgumentException: hostname can't be null"

And a screenshot of the results of "nbctl -l". Mind the missing address-field!: IRI nbctl

Bonus info: As of NOW the IPv6-address that seems to have caused the problem is ping-able right now (I was not around at the time the error occured though, thus I could not check back then)

MaKla89 avatar Sep 03 '19 06:09 MaKla89

I am one of the ipv6-neighbors of Makla89. In my output of nbctl -l the address-field was missing from the Makla's node. After a reset things went back to normal. Some debug info taken before the reset: GetNeighbors, Makla's node:

{
      "connected": false,
      "connectionType": "tcp",
      "numberOfAllTransactions": 0,
      "numberOfDroppedSentPackets": 0,
      "numberOfInvalidTransactions": 0,
      "numberOfNewTransactions": 0,
      "numberOfRandomTransactionRequests": 0,
      "numberOfSentTransactions": 0,
      "numberOfStaleTransactions": 0
    }

Last logging before a java crash:

09/02 21:21:27.324 [Latest Milestone Tracker] INFO  LatestMilestoneTrackerImpl:166 - Latest milestone has changed from #1161954 to #1161955
09/02 21:21:30.176 [Latest Solid Milestone Tracker] INFO  LatestSolidMilestoneTrackerImpl:392 - Latest SOLID milestone index changed from #1161954 to #1161955
terminate called without an active exception
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f910a3678f0, pid=1, tid=0x00007f9087059700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_181-b13) (build 1.8.0_181-b13)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.181-b13 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libc.so.6+0x408f0]  abort+0x230

GetNodeInfo:

{
  "appName": "IRI",
  "appVersion": "1.8.1",
  "jreAvailableProcessors": 4,
  "jreFreeMemory": 2734416400,
  "jreVersion": "1.8.0_181",
  "jreMaxMemory": 4963958784,
  "jreTotalMemory": 4963958784,
  "latestMilestone": "XBFQCPFGPKJDPLWIHQUAQTIQEXUJOSWBQAPILAGUWZZFZAA9Y9UWYIZFRLONFVBWX9LBKKIDZTAG99999",
  "latestMilestoneIndex": 1162405,
  "latestSolidSubtangleMilestone": "XBFQCPFGPKJDPLWIHQUAQTIQEXUJOSWBQAPILAGUWZZFZAA9Y9UWYIZFRLONFVBWX9LBKKIDZTAG99999",
  "latestSolidSubtangleMilestoneIndex": 1162405,
  "milestoneStartIndex": 1143949,
  "lastSnapshottedMilestoneIndex": 1162301,
  "neighbors": 2,
  "packetsQueueSize": 0,
  "time": 1567607814806,
  "tips": 308,
  "transactionsToRequest": 0,
  "features": [
    "snapshotPruning",
    "dnsRefresher",
    "RemotePOW"
  ],
  "coordinatorAddress": "EQSAUZXULTTYZCLNJNTXQTQHOMOFZERHTCGTXOLTVAHKSA9OGAZDEKECURBRIXIJWNPFCQIOVFVVXJVD9",
  "duration": 1
}

Iri startup line in docker: java -XX:+UnlockExperimentalVMOptions -XX:+DisableAttachMechanism -XX:InitiatingHeapOccupancyPercent=60 -XX:G1MaxNewSizePercent=75 -XX:MaxGCPauseMillis=10000 -XX:+UseG1GC -Xms1638m -Xmx4734m -Djava.net.preferIPv4Stack=false -jar /iri/target/iri-1.8.1.jar --remote true --remote-limit-api removeNeighbors, addNeighbors, getNeighbors --neighbors tcp://ip4nbr1:15600 tcp://ip4nbr2:15600 tcp://ip4nbr3:15600 tcp://ip4nbr4:15600 tcp://ip4nbr5:15600 tcp://[ip6nbr1]:15600 --port 14265 --neighboring-socket-address 0.0.0.0 --neighboring-socket-port 15600 -c /iri/conf/iri.ini

TeeVeeEss avatar Sep 04 '19 15:09 TeeVeeEss

We also found out that this error can - more or less - be reproduced by simply restarting either my node or TeeVeeEss' IRI. The OTHER node will get the mentioned error and its IRI-service will crash. This seems to happen every other time. Meaning like this:

  1. I start IRI
  2. T. starts his IRI
  3. at some random time ONE of us performs a restart (whatever reason)
  4. The others node will crash right then
  5. After also restarting the crahsed node, everything works fine until the next restart of any one of us

MaKla89 avatar Sep 04 '19 19:09 MaKla89

We also found out that this error can - more or less - be reproduced by simply restarting either my node or TeeVeeEss' IRI. The OTHER node will get the mentioned error and its IRI-service will crash. This seems to happen every other time. Meaning like this:

  1. I start IRI
  2. T. starts his IRI
  3. at some random time ONE of us performs a restart (whatever reason)
  4. The others node will crash right then
  5. After also restarting the crahsed node, everything works fine until the next restart of any one of us

@MaKla89 @TeeVeeEss Anyone still experiencing this? I setup 2 vms with ubuntu and coulldn't reproduce according to the above.

Add ipv6-neighbor with likely non-static IPv6-adress

I had a static ipv6 though

acha-bill avatar Mar 30 '20 13:03 acha-bill

Makla89 and I are still neighbors, but only as ipv4. I will try to contact him if we can arrange the ipv6 connection again to check if we can reproduce the issue.

TeeVeeEss avatar Apr 04 '20 11:04 TeeVeeEss

We managed to create an ipv6-connection again. It's up for a few hours, let's see how stable it is. During the setup of my ipv6-node (testing ports, firewall and port forwarding) I did saw the same error as Makla reported on sep 3 2019. It was from my other ipv6-node :

Apr 13 00:44:47 iota IRI[5186]: 04/13 00:44:47.396 [Neighbor Router] INFO  NeighborRouterImpl:738 - removed neighbor (hidden:)3b4e:15601 from connected neighbors
Apr 13 00:44:47 iota IRI[5186]: 04/13 00:44:47.398 [Neighbor Router] INFO  NeighborRouterImpl:458 - adding tcp://(hidden:)3b4e:15601 to the reconnect pool
Apr 13 00:44:48 iota IRI[5186]: 04/13 00:44:48.574 [Tips Requester] INFO  TipsRequesterImpl:86 - toProcess = 0 , toBroadcast = 0 , toRequest = 0 , toReply = 0 / totalTransactions = 13901406
Apr 13 00:44:49 iota IRI[5186]: 04/13 00:44:49.387 [Local Snapshots Monitor] INFO  SnapshotServiceImpl:392 - Taking local snapshot [analyzing old solid entry points]: 2.00% ...
Apr 13 00:44:50 iota IRI[5186]: 04/13 00:44:50.742 [XNIO-1 task-17] ERROR API:291 - Unexpected API Exception: hostname can't be null
Apr 13 00:44:54 iota IRI[5186]: 04/13 00:44:54.587 [XNIO-1 task-19] ERROR API:291 - Unexpected API Exception: hostname can't be null
Apr 13 00:44:58 iota IRI[5186]: 04/13 00:44:58.584 [Tips Requester] INFO  TipsRequesterImpl:86 - toProcess = 0 , toBroadcast = 0 , toRequest = 0 , toReply = 0 / totalTransactions = 13901494
Apr 13 00:44:59 iota IRI[5186]: 04/13 00:44:59.294 [pool-10-thread-1] INFO  BatchedBCTCurl:103 - batching saturation ratio 3.0% (processed 323 / max possible 9856), cycles 154, avg. cycle time 130.0ms
Apr 13 00:44:59 iota IRI[5186]: 04/13 00:44:59.698 [Local Snapshots Monitor] INFO  SnapshotServiceImpl:392 - Taking local snapshot [analyzing old solid entry points]: 2.01% ...
Apr 13 00:45:00 iota IRI[5186]: 04/13 00:45:00.742 [XNIO-1 task-20] ERROR API:291 - Unexpected API Exception: hostname can't be null
Apr 13 00:46:30 iota IRI[5186]: 04/13 00:46:30.831 [XNIO-1 task-6] ERROR API:291 - Unexpected API Exception: hostname can't be null
Apr 13 00:46:38 iota IRI[5186]: 04/13 00:46:38.280 [Neighbor Router] INFO  NeighborRouterImpl:612 - establishing connections to 1 wanted neighbors [tcp://(hidden:)3b4e:15601]
Apr 13 00:46:38 iota IRI[5186]: 04/13 00:46:38.282 [Neighbor Router] INFO  NeighborRouterImpl:238 - neighbor router stopped
Apr 13 00:46:38 iota IRI[5186]: Exception in thread "Neighbor Router" java.lang.IllegalArgumentException: hostname can't be null
Apr 13 00:46:38 iota IRI[5186]:         at java.base/java.net.InetSocketAddress.checkHost(InetSocketAddress.java:149)
Apr 13 00:46:38 iota IRI[5186]:         at java.base/java.net.InetSocketAddress.<init>(InetSocketAddress.java:216)
Apr 13 00:46:38 iota IRI[5186]:         at com.iota.iri.network.NeighborRouterImpl.lambda$connectToWantedNeighbors$2(NeighborRouterImpl.java:614)
Apr 13 00:46:38 iota IRI[5186]:         at java.base/java.util.concurrent.CopyOnWriteArrayList.forEach(CopyOnWriteArrayList.java:803)
Apr 13 00:46:38 iota IRI[5186]:         at java.base/java.util.concurrent.CopyOnWriteArraySet.forEach(CopyOnWriteArraySet.java:425)
Apr 13 00:46:38 iota IRI[5186]:         at com.iota.iri.network.NeighborRouterImpl.connectToWantedNeighbors(NeighborRouterImpl.java:613)
Apr 13 00:46:38 iota IRI[5186]:         at com.iota.iri.network.NeighborRouterImpl.route(NeighborRouterImpl.java:178)
Apr 13 00:46:38 iota IRI[5186]:         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
Apr 13 00:46:38 iota IRI[5186]:         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
Apr 13 00:46:38 iota IRI[5186]:         at java.base/java.lang.Thread.run(Thread.java:834)

The neighbour router did not start again, had to restart the node to get a normal working node. It happened when port 15601 was closed. Looks like something went wrong with using dns to find the hostname of the ipv6-address. As this ipv6-address has no hostname, it could be that null is not caught in the code.

TeeVeeEss avatar Apr 13 '20 17:04 TeeVeeEss

After a day or so we both see the error hostname can't be null A curl for getNeighbors gives:

{
  "exception": "hostname can't be null",
  "duration": 1
}

Shutting down my node gave a hs_err_pidxxx.log, copied, to https://ybin.me/p/f4f54a5bdae7aa91#Pl3rCg7cW1IecwplBVlMmUACT3NwgiyqzUOd52M9Cy8=

TeeVeeEss avatar Apr 14 '20 18:04 TeeVeeEss

It might be relevant to add that FIRST one node get that message and crashes and only after I restarted mine, I saw TeeVee's node crashing I think. It creates some sort of circle of crashing

MaKla89 avatar Apr 14 '20 18:04 MaKla89

@acha-bill We stopped the connection between our ipv6-nodes, the pattern of the ipv6 dns-name resolving to null is leading to not working nodes.

TeeVeeEss avatar Apr 17 '20 20:04 TeeVeeEss

Ok @TeeVeeEss . Thanks for confirming that this is still an issue

acha-bill avatar Apr 18 '20 10:04 acha-bill