bisq icon indicating copy to clipboard operation
bisq copied to clipboard

No connection, Peer discovery message fills 220Mb of log in short period of time

Open jmacxx opened this issue 2 years ago • 3 comments

Bisq can get into a state where it checks for network peers constantly, filling the logs with the connection attempts. e.g. Peer discovery didn't provide us any more peers, will try again in 10000ms According to the log message, there should be a 10000ms delay between connection attempts, but examples clearly contradict this. The problem with these messages spamming all 20 logfiles is that it prevents other, valid trade issues from being investigated. A high percentage of supplied user logs contain examples of this. I'm hoping this can be fixed in Bisq's fork of BitcoinJ.

Example:

Click to expand
Dec-31 15:04:19.124 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery took 100.0 ns and returned 0 items
Dec-31 15:04:19.124 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery didn't provide us any more peers, will try again in 10000ms.
Dec-31 15:04:19.124 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery took 100.0 ns and returned 0 items
Dec-31 15:04:19.124 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery didn't provide us any more peers, will try again in 10000ms.
Dec-31 15:04:19.124 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery took 0.000 ns and returned 0 items
Dec-31 15:04:19.124 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery didn't provide us any more peers, will try again in 10000ms.
Dec-31 15:04:19.124 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery took 0.000 ns and returned 0 items
Dec-31 15:04:19.125 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery didn't provide us any more peers, will try again in 10001ms.
Dec-31 15:04:19.125 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery took 100.0 ns and returned 0 items
Dec-31 15:04:19.125 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery didn't provide us any more peers, will try again in 10000ms.
Dec-31 15:04:19.125 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery took 0.000 ns and returned 0 items
Dec-31 15:04:19.125 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery didn't provide us any more peers, will try again in 10000ms.
Dec-31 15:04:19.125 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery took 0.000 ns and returned 0 items
Dec-31 15:04:19.125 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery didn't provide us any more peers, will try again in 10000ms.
Dec-31 15:04:19.125 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery took 0.000 ns and returned 0 items
Dec-31 15:04:19.125 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery didn't provide us any more peers, will try again in 10000ms.
Dec-31 15:04:19.125 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery took 0.000 ns and returned 0 items
Dec-31 15:04:19.125 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery didn't provide us any more peers, will try again in 10000ms.
Dec-31 15:04:19.125 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery took 0.000 ns and returned 0 items
Dec-31 15:04:19.125 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery didn't provide us any more peers, will try again in 10000ms.
Dec-31 15:04:19.126 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery took 0.000 ns and returned 0 items
Dec-31 15:04:19.126 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery didn't provide us any more peers, will try again in 10000ms.
Dec-31 15:04:19.126 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery took 0.000 ns and returned 0 items
Dec-31 15:04:19.126 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery didn't provide us any more peers, will try again in 10000ms.
Dec-31 15:04:19.126 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery took 0.000 ns and returned 0 items
Dec-31 15:04:19.126 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery didn't provide us any more peers, will try again in 10000ms.
Dec-31 15:04:19.126 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery took 0.000 ns and returned 0 items
Dec-31 15:04:19.126 [PeerGroup Thread] INFO  o.b.c.PeerGroup: Peer discovery didn't provide us any more peers, will try again in 10000ms.

jmacxx avatar Jan 24 '22 22:01 jmacxx

There is a minor resource leak of callable future tasks in PeerGroup.java.

The leak occurs during a connection group backoff period. A connection group backoff is initiated whenever a connection dies if there is no existing backoff in progress. The group backoff is indicated in the logs by a message like "Waiting 41869 ms before next connect attempt to [address]". During this time, if another connection dies it will leak one callable future task.
In the case of Bisq with maxconnections=7 and no inet availablity, the leak will be 6 futures for every connection group backoff.

The leak grows with each cycle of connection attempts.

triggerConnectionsJob is invoked by each of the leaked resources perpetually (with random delays) until all connections are restored - this is what fills up the logs.

Observed size of the leaked future was 100 after 5 minutes, 200 after 10 minutes, 300 after an hour, 1300 after 10 hours.
It has the ability to completely fill 220Mb of bisq logs if left running with no network connection for several days. (Surprisingly, this happens in more user logs than you would expect -> possibly they configured to use their own node which then went down).

Example: bisq_20.zip from a user; 82000 lines filled by the leaked log messages in half an hour. All 20 of their logs were like this.


To reproduce:

  • Run Bisq with no reachable internet. After an hour the logging will clearly show the problem.
  • Make the internet reachable again. Observe that nodes become connected after a few minutes.

Modifying this logging code enables you to see the leak count increase over time.

log.info("Waiting {} ms before next connect attempt {} leak={}", delay, addrToTry == null ? "" : "to " + addrToTry, scheduledThreadPoolExecutor.getQueue().size());

The problem from a code perspective.

PeerGroup triggerConnectionsJob: basic function is "do the next thing necessary regarding making peer connections", which in many cases can be simply do nothing at this time and try again later. That means it can be called many times without ill-behaviour, (except producing exessive logging, and the resource leak).

Algorithm: The job does one of the following, in order of precedence:

1- check for a local bitcoin node. (only done first time)
2- perform a discovery process to find nodes to connect to.  (Bisq does not use discovery, instead we provide a list of node addresses).
3- if ipv6 connectivity has gone down, remove all ipv6 node addresses from the list of potential connections.
4- decide there are no available nodes to connect to at this time, so schedule a future time to re-do triggerConnectionsJob.
5- decide we have enough active node connections, so do nothing.
6- decide that we have a node to connect to, but the backoff time has not elapsed yet, so log and schedule a future time to re-do triggerConnectionsJob.
7- open a connection to an available node, and schedule an immediate call to triggerConnectionsJob (so that all available connections get started).

When a node connection dies, triggerConnectionsJob is invoked via callable future, which then essentially calls itself via delayed futures bootstrapping until all new connections can be made. When connections have been made, it stops calling itself, relying on the connection death event to re-start the process. If more than one connection dies within 10 seconds (typically or whatever the variable backoff time happens to be), then callable futures are leaked.


Example:

bitcoinj is provided with 4 addresses for connecting to: a,b,c,d.
(in practice it is provided 16 federated Bisq addresses, simplified here for demonstration purposes).

bitcoinj opens connections to a,b,c,d as described above algorithm point 7.
the connection to a dies, bitcoinj triggers the scheduler which creates a future call for +10 seconds to reconnect (resource=future1)
the connection to b dies, bitcoinj triggers the scheduler which creates a future call for +10 seconds to reconnect (resource=future2)
the connection to c dies, bitcoinj triggers the scheduler which creates a future call for +10 seconds to reconnect (resource=future3)
the connection to d dies, bitcoinj triggers the scheduler which creates a future call for +10 seconds to reconnect (resource=future4)

at this point future1 is useful, future 2,3,4 are leaked resources.

10 seconds passes.
the scheduler is invoked by future1, it opens connections to a,b,c,d as described above algorithm point 7.
the scheduler is invoked by future2, it logs peer discovery provided no addresses and will retry in 10 seconds (resource=future5).
the scheduler is invoked by future3, it logs peer discovery provided no addresses and will retry in 10 seconds (resource=future6).
the scheduler is invoked by future4, it logs peer discovery provided no addresses and will retry in 10 seconds (resource=future7).

future 5,6,7 are now the leaked resources

the connection to a dies, bitcoinj triggers the scheduler which creates a future call for +10 seconds to reconnect (resource=future8)
the connection to b dies, bitcoinj triggers the scheduler which creates a future call for +10 seconds to reconnect (resource=future9)
the connection to c dies, bitcoinj triggers the scheduler which creates a future call for +10 seconds to reconnect (resource=future10)
the connection to d dies, bitcoinj triggers the scheduler which creates a future call for +10 seconds to reconnect (resource=future11)

at this point future8 is useful, future 5,6,7,9,10,11 are leaked resources.

the scheduler is invoked by future5, it logs peer discovery provided no addresses and will retry in 10 seconds (resource=future12).
the scheduler is invoked by future6, it logs peer discovery provided no addresses and will retry in 10 seconds (resource=future13).
the scheduler is invoked by future7, it logs peer discovery provided no addresses and will retry in 10 seconds (resource=future14).

the scheduler is invoked by future8, it opens connections to a,b,c,d as described above algorithm point 6.
the scheduler is invoked by future9, it logs peer discovery provided no addresses and will retry in 10 seconds (resource=future15).
the scheduler is invoked by future10, it logs peer discovery provided no addresses and will retry in 10 seconds (resource=future16).
the scheduler is invoked by future11, it logs peer discovery provided no addresses and will retry in 10 seconds (resource=future17).

future 12,13,14,15,16,17 are now the leaked resources

[repeats]

Solution:

The leak and excessive logging can be mitigated by the following change at L580 of PeerGroup.java:

private void triggerConnections() {
+    if (scheduledThreadPoolExecutor.getQueue().size() > 1) {
+        return;
+    }

(scheduledThreadPoolExecutor has to be made accessible via createPrivateExecutor in order to query its queue size).


Conclusion

Overall I suppose the leak is fairly benign, the annoyance is user's logs can fill up easily, preventing any problem diagnosis. The resource leak seems quite small and does clear up once connectivity is restored.

jmacxx avatar Jan 27 '22 04:01 jmacxx

Could you please prepare a PR against our bitcoinj fork and I'll merge and add it to the Bisq repository afterwards?

ripcurlx avatar Jan 27 '22 09:01 ripcurlx

Unfortunately https://github.com/jmacxx/bitcoinj/commit/f70cceff7d2c15d4e71d191e29d1436c6c118d60 breaks some tests in bitcoinj, I'll look into it.

jmacxx avatar Feb 01 '22 03:02 jmacxx