iri
iri copied to clipboard
IPv6 neighbor failure in name resolution crashes IRI
Bug description
I'm operating an ipv6-IRI-Node with a total of 4 neighbors that usually work fine (or at least have been working and do so for some hours / days after a reboot).
After starting the node and waiting for some hours / days, journal outputs turns to 0-values only, like this:
08/25 11:08:09.409 [Tips Requester] INFO TipsRequesterImpl:86 - toProcess = 0 , toBroadcast = 0 , toRequest = 0 , toReply = 0 / totalTransactions = 4821933
Output of "nbctl -l" (which just makes a call to the IRI API for getNeighbors) during this crash-state shows 3 correct entries and for the last neighbor this:
{ "connected": false, "connectionType": "tcp", "numberOfAllTransactions": 0, "numberOfDroppedSentPackets": 0, "numberOfInvalidTransactions": 0, "numberOfNewTransactions": 0, "numberOfRandomTransactionRequests": 0, "numberOfSentTransactions": 0, "numberOfStaleTransactions": 0 }
When I try to ping the faulty neighbor, this happens :
ping [2001:xxx:xxxx:xxx:xxx:xxx]
ping: [2001:xxx:xxxx:xxx:xxx:xxx]: Temporary failure in name resolution
IRI version
1.8.1 (dockerized playbook)
Hardware Spec
Intel NUC - Celeron j3455 4-core, 8GB RAM, SSD
Steps To Reproduce
- Add ipv6-neighbor with likely non-static IPv6-adress
- wait
- check when "journalctl -fu iri" does only show 0-values for tx-counts
Expected behaviour
IRI should continue operation for the 3 working neighbors and just ignore the neighbor with name resolution issues.
Actual behaviour
According to the journal, all IRI-acitivities drop to 0 TX. Restarting the node resolves the issue for some time (likely until next ipv6-Update from "faulty" neighbor?)
Errors
"Temporary failure in name resolution"
Just to add: it seems that getNeighbors in the above case returns a neighbor missing address
field.
I guess this would also be the case when name resolution fails for IPv4?
I think it would be very useful to display all the neighbors, always, and add a new field called display_name
. This should be the name with which a user has added a neighbor.
address
and domain
fields can remain, and whether a connection is established is anyway shown in connected
field.
The error-state was triggered again yesterday evening and is consistent until today (since I did not restart so far). I've taken screenshots from the reverse IRI-Logfile:
The important part seems to be "Exception in thread "Neighbor Router" java.lang.IllegalArgumentException: hostname can't be null"
And a screenshot of the results of "nbctl -l". Mind the missing address-field!:
Bonus info: As of NOW the IPv6-address that seems to have caused the problem is ping-able right now (I was not around at the time the error occured though, thus I could not check back then)
I am one of the ipv6-neighbors of Makla89. In my output of nbctl -l
the address-field was missing from the Makla's node. After a reset things went back to normal. Some debug info taken before the reset:
GetNeighbors, Makla's node:
{
"connected": false,
"connectionType": "tcp",
"numberOfAllTransactions": 0,
"numberOfDroppedSentPackets": 0,
"numberOfInvalidTransactions": 0,
"numberOfNewTransactions": 0,
"numberOfRandomTransactionRequests": 0,
"numberOfSentTransactions": 0,
"numberOfStaleTransactions": 0
}
Last logging before a java crash:
09/02 21:21:27.324 [Latest Milestone Tracker] INFO LatestMilestoneTrackerImpl:166 - Latest milestone has changed from #1161954 to #1161955
09/02 21:21:30.176 [Latest Solid Milestone Tracker] INFO LatestSolidMilestoneTrackerImpl:392 - Latest SOLID milestone index changed from #1161954 to #1161955
terminate called without an active exception
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f910a3678f0, pid=1, tid=0x00007f9087059700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_181-b13) (build 1.8.0_181-b13)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.181-b13 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C [libc.so.6+0x408f0] abort+0x230
GetNodeInfo:
{
"appName": "IRI",
"appVersion": "1.8.1",
"jreAvailableProcessors": 4,
"jreFreeMemory": 2734416400,
"jreVersion": "1.8.0_181",
"jreMaxMemory": 4963958784,
"jreTotalMemory": 4963958784,
"latestMilestone": "XBFQCPFGPKJDPLWIHQUAQTIQEXUJOSWBQAPILAGUWZZFZAA9Y9UWYIZFRLONFVBWX9LBKKIDZTAG99999",
"latestMilestoneIndex": 1162405,
"latestSolidSubtangleMilestone": "XBFQCPFGPKJDPLWIHQUAQTIQEXUJOSWBQAPILAGUWZZFZAA9Y9UWYIZFRLONFVBWX9LBKKIDZTAG99999",
"latestSolidSubtangleMilestoneIndex": 1162405,
"milestoneStartIndex": 1143949,
"lastSnapshottedMilestoneIndex": 1162301,
"neighbors": 2,
"packetsQueueSize": 0,
"time": 1567607814806,
"tips": 308,
"transactionsToRequest": 0,
"features": [
"snapshotPruning",
"dnsRefresher",
"RemotePOW"
],
"coordinatorAddress": "EQSAUZXULTTYZCLNJNTXQTQHOMOFZERHTCGTXOLTVAHKSA9OGAZDEKECURBRIXIJWNPFCQIOVFVVXJVD9",
"duration": 1
}
Iri startup line in docker:
java -XX:+UnlockExperimentalVMOptions -XX:+DisableAttachMechanism -XX:InitiatingHeapOccupancyPercent=60 -XX:G1MaxNewSizePercent=75 -XX:MaxGCPauseMillis=10000 -XX:+UseG1GC -Xms1638m -Xmx4734m -Djava.net.preferIPv4Stack=false -jar /iri/target/iri-1.8.1.jar --remote true --remote-limit-api removeNeighbors, addNeighbors, getNeighbors --neighbors tcp://ip4nbr1:15600 tcp://ip4nbr2:15600 tcp://ip4nbr3:15600 tcp://ip4nbr4:15600 tcp://ip4nbr5:15600 tcp://[ip6nbr1]:15600 --port 14265 --neighboring-socket-address 0.0.0.0 --neighboring-socket-port 15600 -c /iri/conf/iri.ini
We also found out that this error can - more or less - be reproduced by simply restarting either my node or TeeVeeEss' IRI. The OTHER node will get the mentioned error and its IRI-service will crash. This seems to happen every other time. Meaning like this:
- I start IRI
- T. starts his IRI
- at some random time ONE of us performs a restart (whatever reason)
- The others node will crash right then
- After also restarting the crahsed node, everything works fine until the next restart of any one of us
We also found out that this error can - more or less - be reproduced by simply restarting either my node or TeeVeeEss' IRI. The OTHER node will get the mentioned error and its IRI-service will crash. This seems to happen every other time. Meaning like this:
- I start IRI
- T. starts his IRI
- at some random time ONE of us performs a restart (whatever reason)
- The others node will crash right then
- After also restarting the crahsed node, everything works fine until the next restart of any one of us
@MaKla89 @TeeVeeEss Anyone still experiencing this? I setup 2 vms with ubuntu and coulldn't reproduce according to the above.
Add ipv6-neighbor with likely non-static IPv6-adress
I had a static ipv6 though
Makla89 and I are still neighbors, but only as ipv4. I will try to contact him if we can arrange the ipv6 connection again to check if we can reproduce the issue.
We managed to create an ipv6-connection again. It's up for a few hours, let's see how stable it is. During the setup of my ipv6-node (testing ports, firewall and port forwarding) I did saw the same error as Makla reported on sep 3 2019. It was from my other ipv6-node :
Apr 13 00:44:47 iota IRI[5186]: 04/13 00:44:47.396 [Neighbor Router] INFO NeighborRouterImpl:738 - removed neighbor (hidden:)3b4e:15601 from connected neighbors
Apr 13 00:44:47 iota IRI[5186]: 04/13 00:44:47.398 [Neighbor Router] INFO NeighborRouterImpl:458 - adding tcp://(hidden:)3b4e:15601 to the reconnect pool
Apr 13 00:44:48 iota IRI[5186]: 04/13 00:44:48.574 [Tips Requester] INFO TipsRequesterImpl:86 - toProcess = 0 , toBroadcast = 0 , toRequest = 0 , toReply = 0 / totalTransactions = 13901406
Apr 13 00:44:49 iota IRI[5186]: 04/13 00:44:49.387 [Local Snapshots Monitor] INFO SnapshotServiceImpl:392 - Taking local snapshot [analyzing old solid entry points]: 2.00% ...
Apr 13 00:44:50 iota IRI[5186]: 04/13 00:44:50.742 [XNIO-1 task-17] ERROR API:291 - Unexpected API Exception: hostname can't be null
Apr 13 00:44:54 iota IRI[5186]: 04/13 00:44:54.587 [XNIO-1 task-19] ERROR API:291 - Unexpected API Exception: hostname can't be null
Apr 13 00:44:58 iota IRI[5186]: 04/13 00:44:58.584 [Tips Requester] INFO TipsRequesterImpl:86 - toProcess = 0 , toBroadcast = 0 , toRequest = 0 , toReply = 0 / totalTransactions = 13901494
Apr 13 00:44:59 iota IRI[5186]: 04/13 00:44:59.294 [pool-10-thread-1] INFO BatchedBCTCurl:103 - batching saturation ratio 3.0% (processed 323 / max possible 9856), cycles 154, avg. cycle time 130.0ms
Apr 13 00:44:59 iota IRI[5186]: 04/13 00:44:59.698 [Local Snapshots Monitor] INFO SnapshotServiceImpl:392 - Taking local snapshot [analyzing old solid entry points]: 2.01% ...
Apr 13 00:45:00 iota IRI[5186]: 04/13 00:45:00.742 [XNIO-1 task-20] ERROR API:291 - Unexpected API Exception: hostname can't be null
Apr 13 00:46:30 iota IRI[5186]: 04/13 00:46:30.831 [XNIO-1 task-6] ERROR API:291 - Unexpected API Exception: hostname can't be null
Apr 13 00:46:38 iota IRI[5186]: 04/13 00:46:38.280 [Neighbor Router] INFO NeighborRouterImpl:612 - establishing connections to 1 wanted neighbors [tcp://(hidden:)3b4e:15601]
Apr 13 00:46:38 iota IRI[5186]: 04/13 00:46:38.282 [Neighbor Router] INFO NeighborRouterImpl:238 - neighbor router stopped
Apr 13 00:46:38 iota IRI[5186]: Exception in thread "Neighbor Router" java.lang.IllegalArgumentException: hostname can't be null
Apr 13 00:46:38 iota IRI[5186]: at java.base/java.net.InetSocketAddress.checkHost(InetSocketAddress.java:149)
Apr 13 00:46:38 iota IRI[5186]: at java.base/java.net.InetSocketAddress.<init>(InetSocketAddress.java:216)
Apr 13 00:46:38 iota IRI[5186]: at com.iota.iri.network.NeighborRouterImpl.lambda$connectToWantedNeighbors$2(NeighborRouterImpl.java:614)
Apr 13 00:46:38 iota IRI[5186]: at java.base/java.util.concurrent.CopyOnWriteArrayList.forEach(CopyOnWriteArrayList.java:803)
Apr 13 00:46:38 iota IRI[5186]: at java.base/java.util.concurrent.CopyOnWriteArraySet.forEach(CopyOnWriteArraySet.java:425)
Apr 13 00:46:38 iota IRI[5186]: at com.iota.iri.network.NeighborRouterImpl.connectToWantedNeighbors(NeighborRouterImpl.java:613)
Apr 13 00:46:38 iota IRI[5186]: at com.iota.iri.network.NeighborRouterImpl.route(NeighborRouterImpl.java:178)
Apr 13 00:46:38 iota IRI[5186]: at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
Apr 13 00:46:38 iota IRI[5186]: at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
Apr 13 00:46:38 iota IRI[5186]: at java.base/java.lang.Thread.run(Thread.java:834)
The neighbour router did not start again, had to restart the node to get a normal working node. It happened when port 15601 was closed. Looks like something went wrong with using dns to find the hostname of the ipv6-address. As this ipv6-address has no hostname, it could be that null is not caught in the code.
After a day or so we both see the error hostname can't be null
A curl for getNeighbors
gives:
{
"exception": "hostname can't be null",
"duration": 1
}
Shutting down my node gave a hs_err_pidxxx.log, copied, to https://ybin.me/p/f4f54a5bdae7aa91#Pl3rCg7cW1IecwplBVlMmUACT3NwgiyqzUOd52M9Cy8=
It might be relevant to add that FIRST one node get that message and crashes and only after I restarted mine, I saw TeeVee's node crashing I think. It creates some sort of circle of crashing
@acha-bill We stopped the connection between our ipv6-nodes, the pattern of the ipv6 dns-name resolving to null
is leading to not working nodes.
Ok @TeeVeeEss . Thanks for confirming that this is still an issue