monero-lws icon indicating copy to clipboard operation
monero-lws copied to clipboard

Monero zmq crash takes down `monero-lws-daemon`. (Probably)

Open trasherdk opened this issue 3 years ago • 17 comments

monerod is running, all good.

tcp        0      0 0.0.0.0:38080      0.0.0.0:*               LISTEN      7425/monerod        
tcp        0      0 127.0.0.1:38081    0.0.0.0:*               LISTEN      7425/monerod        
tcp        0      0 1.2.3.4:38082      0.0.0.0:*               LISTEN      7425/monerod        
tcp        0      0 1.2.3.4:38083      0.0.0.0:*               LISTEN      7425/monerod        
tcp        0      0 1.2.3.4:38084      0.0.0.0:*               LISTEN      7425/monerod        

monerod is still running, but zmq is gone.

tcp        0      0 0.0.0.0:38080      0.0.0.0:*               LISTEN      6527/monerod        
tcp        0      0 127.0.0.1:38081    0.0.0.0:*               LISTEN      6527/monerod        
tcp        0      0 1.2.3.4:38084      0.0.0.0:*               LISTEN      6527/monerod        

monero-lws-daemon is running on a different host. After some time, after zmq is gone, this happens.

2022-11-11 21:34:43.847 I Retrieving current active account list
2022-11-11 21:34:43.847 I No active accounts
2022-11-11 21:34:53.848 I Starting blockchain sync with daemon
2022-11-11 21:35:24.315 W Failed to connect to daemon at tcp://1.2.3.4:38082
2022-11-11 21:35:24.315 I Retrieving current active account list
2022-11-11 21:35:24.315 I No active accounts
2022-11-11 21:35:34.356 E thrown at scanner.cpp:800: Too many open files

There's a shitload of Failed to connect to daemon at tcp://1.2.3.4:38082 before the crash, so how long after, I don't know.

trasherdk avatar Nov 12 '22 01:11 trasherdk

I'm starting to think the lws crash was unrelated to zmq server was down.

Without a log-file on monero-lws-daemon it's impossible to say which one went down first.

The monero-lws-daemon does not have that many file-handles open. 430-450 including events.

I shut monerod down at 2022-11-12 01:28:18.875 W Failed to connect to daemon at tcp://1.2.3.4:38082 monero-lws-daemon is still running at 2022-11-12 05:47:02.075 W Failed to connect to daemon at tcp://1.2.3.4:38082

Could it be something it received from zmq ?

This is when zmq crashed on monerod:

2022-11-11 11:25:59.635 [P2P1]  INFO    net.p2p.msg     src/cryptonote_protocol/cryptonote_protocol_handler.inl:550     [188.34.179.54:52902 INC] Received NOTIFY_NEW_FLUFFY_BLOCK <e95a3a4ea3960a337d047fa778a540fd3fe56e89fa46fa4623d07dbe34d8c285> (height 1220696, 0 txes)
2022-11-11 11:25:59.635 [P2P4]  INFO    perf.blockchain src/common/perf_timer.cpp:120   PERF             ----------
2022-11-11 11:25:59.635 [P2P4]  INFO    perf.blockchain src/common/perf_timer.cpp:156   PERF        7      get_long_term_block_weights
2022-11-11 11:25:59.635 [P2P4]  INFO    perf.blockchain src/common/perf_timer.cpp:156   PERF      125    update_next_cumulative_weight_limit
2022-11-11 11:25:59.636 [P2P4]  INFO    blockchain      src/cryptonote_core/blockchain.cpp:4524 +++++ BLOCK SUCCESSFULLY ADDED
2022-11-11 11:25:59.636 [P2P4]  INFO    blockchain      src/cryptonote_core/blockchain.cpp:4524 id:     <e95a3a4ea3960a337d047fa778a540fd3fe56e89fa46fa4623d07dbe34d8c285>
2022-11-11 11:25:59.636 [P2P4]  INFO    blockchain      src/cryptonote_core/blockchain.cpp:4524 PoW:    <ea79e4acc1b0c5289d896a971b107a2a54a2ffe563bf25b90493c84f8a510000>
2022-11-11 11:25:59.636 [P2P4]  INFO    blockchain      src/cryptonote_core/blockchain.cpp:4524 HEIGHT 1220695, difficulty:     111144
2022-11-11 11:25:59.636 [P2P4]  INFO    blockchain      src/cryptonote_core/blockchain.cpp:4524 block reward: 3.538476387438(3.538476387438 + 0.000000000000), coinbase_weight: 85, cumulative weight: 85, 34(0/33)ms
2022-11-11 11:25:59.637 [P2P5]  INFO    net.p2p.traffic contrib/epee/include/net/levin_protocol_handler_async.h:57      [78.46.102.70:38080 OUT] 187 bytes received for category command-2008 initiated by peer
2022-11-11 11:25:59.637 [P2P5]  INFO    net.p2p.msg     src/cryptonote_protocol/cryptonote_protocol_handler.inl:550     [78.46.102.70:38080 OUT] Received NOTIFY_NEW_FLUFFY_BLOCK <e95a3a4ea3960a337d047fa778a540fd3fe56e89fa46fa4623d07dbe34d8c285> (height 1220696, 0 txes)
2022-11-11 11:25:59.638 [P2P4]  ERROR   net.zmq src/rpc/zmq_pub.cpp:538 Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-11 11:25:59.638 [P2P4]  ERROR   net.zmq src/rpc/zmq_pub.cpp:529 Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-11 11:25:59.641 [P2P0]  INFO    net.p2p.traffic contrib/epee/include/net/levin_protocol_handler_async.h:57      [18.206.13.223:54532 INC] 187 bytes received for category command-2008 initiated by peer

trasherdk avatar Nov 12 '22 06:11 trasherdk

Okay, here we go again. monerod

2022-11-12 11:18:47.711        I Found block <ae713803b7f463493a421939c51a6bc7f322db359393260ea219736ba282e1c2> at height 1221412 for difficulty: 106982
2022-11-12 11:26:20.469        I Found block <f93d44351b7fb02f7e8af8a6b5554556fe5defa67f376e950ae894c3baeaeb98> at height 1221418 for difficulty: 106276
2022-11-12 11:47:00.976        I Found block <7612be135b7d7ff85a3caab7e2d286ac5f3b1498821ddfd88390b7249412b8c0> at height 1221434 for difficulty: 107536
2022-11-12 12:10:56.926        E ZMQ RPC Server Error: thrown at zmq_server.cpp:161: Resource temporarily unavailable
2022-11-12 12:11:18.102        I Found block <b50745860f385ff1a2a12c88deec69af22cab6e04951c10aaeda7a31398337d1> at height 1221447 for difficulty: 106169
2022-11-12 12:11:18.152        E Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-12 12:11:18.153        E Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-12 12:11:49.473        E Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-12 12:11:49.473        E Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-12 12:13:19.091        E Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-12 12:13:19.091        E Unable to send ZMQ/Pub - ZMQ server destroyed
...
...
2022-11-12 18:26:02.081        E Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-12 18:26:02.081        E Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-12 18:26:52.542        I Found block <ace0338ea93db5f7d9c24dc7300e0b2a55e2fee48c3fd5f2b28754853adae3b6> at height 1221669 for difficulty: 114610
2022-11-12 18:26:52.583        E Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-12 18:26:52.583        E Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-12 18:27:40.895        E Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-12 18:27:52.347        E Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-12 18:27:52.347        E Unable to send ZMQ/Pub - ZMQ server destroyed

monero-lws

2022-11-12 18:25:37.079 W Failed to connect to daemon at tcp://1.2.3.4:38082
2022-11-12 18:25:37.079 I Retrieving current active account list
2022-11-12 18:25:37.079 I No active accounts
2022-11-12 18:25:47.103 I Starting blockchain sync with daemon
2022-11-12 18:26:17.451 W Failed to connect to daemon at tcp://1.2.3.4:38082
2022-11-12 18:26:17.451 I Retrieving current active account list
2022-11-12 18:26:17.452 I No active accounts
2022-11-12 18:26:27.459 I Starting blockchain sync with daemon
2022-11-12 18:26:57.702 W Failed to connect to daemon at tcp://1.2.3.4:38082
2022-11-12 18:26:57.702 I Retrieving current active account list
2022-11-12 18:26:57.702 I No active accounts
2022-11-12 18:27:07.740 E thrown at scanner.cpp:800: Too many open files

trasherdk avatar Nov 12 '22 21:11 trasherdk

And, here's the testnet crash.

2022-11-12 12:10:56.683     7f48f57fa700        ERROR   net.zmq src/rpc/zmq_server.cpp:174      ZMQ RPC Server Error: thrown at zmq_server.cpp:161: Resource temporarily unavailable
2022-11-12 12:12:25.187 [P2P6]  WARNING cn      src/cryptonote_core/cryptonote_core.cpp:2048    There were 17 blocks in the last 90 minutes, there might be large hash rate changes, or we might be partitioned, cut off from the Monero network or under attack, or your computer's time is off. Or it could be just sheer bad luck.
2022-11-12 12:13:55.304 [P2P8]  WARNING cn      src/cryptonote_core/cryptonote_core.cpp:2048    There were 16 blocks in the last 90 minutes, there might be large hash rate changes, or we might be partitioned, cut off from the Monero network or under attack, or your computer's time is off. Or it could be just sheer bad luck.
2022-11-12 12:15:25.436 [P2P3]  WARNING cn      src/cryptonote_core/cryptonote_core.cpp:2048    There were 16 blocks in the last 90 minutes, there might be large hash rate changes, or we might be partitioned, cut off from the Monero network or under attack, or your computer's time is off. Or it could be just sheer bad luck.
2022-11-12 12:16:55.655 [P2P0]  WARNING cn      src/cryptonote_core/cryptonote_core.cpp:2048    There were 16 blocks in the last 90 minutes, there might be large hash rate changes, or we might be partitioned, cut off from the Monero network or under attack, or your computer's time is off. Or it could be just sheer bad luck.
2022-11-12 12:18:25.831 [P2P4]  WARNING cn      src/cryptonote_core/cryptonote_core.cpp:2048    There were 16 blocks in the last 90 minutes, there might be large hash rate changes, or we might be partitioned, cut off from the Monero network or under attack, or your computer's time is off. Or it could be just sheer bad luck.
2022-11-12 12:19:55.908 [P2P1]  WARNING cn      src/cryptonote_core/cryptonote_core.cpp:2048    There were 15 blocks in the last 90 minutes, there might be large hash rate changes, or we might be partitioned, cut off from the Monero network or under attack, or your computer's time is off. Or it could be just sheer bad luck.
2022-11-12 12:20:41.110 [P2P1]  ERROR   net.zmq src/rpc/zmq_pub.cpp:538 Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-12 12:20:41.110 [P2P1]  ERROR   net.zmq src/rpc/zmq_pub.cpp:529 Unable to send ZMQ/Pub - ZMQ server destroyed

trasherdk avatar Nov 12 '22 21:11 trasherdk

I got a little more info from the latest crash.

2022-11-13 19:42:00.245        W Failed to connect to daemon at tcp://1.2.3.4:38082
2022-11-13 19:42:00.245        I Retrieving current active account list
2022-11-13 19:42:00.246        I No active accounts
2022-11-13 19:42:10.252        D Problems at ssl shutdown: uninitialized (SSL routines, SSL_shutdown)
2022-11-13 19:42:10.253        D Problems at cancel: Bad file descriptor
2022-11-13 19:42:10.253        D Problems at shutdown: Bad file descriptor
2022-11-13 19:42:10.258        D Destructing connection #0 to 0.0.0.0
2022-11-13 19:42:10.258        E thrown at scanner.cpp:800: Too many open files

Turning log-level up to 4 and restarting....

trasherdk avatar Nov 13 '22 23:11 trasherdk

Restarting monero-lws-daemon got me this:

2022-11-13 23:58:37.170   I Using monerod ZMQ RPC at tcp://1.2.3.4:38082
2022-11-13 23:58:37.170        I Starting blockchain sync with daemon
2022-11-13 23:59:07.715        D Problems at ssl shutdown: uninitialized (SSL routines, SSL_shutdown)
2022-11-13 23:59:07.715        D Problems at cancel: Bad file descriptor
2022-11-13 23:59:07.715        D Problems at shutdown: Bad file descriptor
2022-11-13 23:59:07.716        E Connection failed with daemon

Restarting monerod, ZMQ was down, and restarting monero-lws-daemon.. waiting...

trasherdk avatar Nov 14 '22 00:11 trasherdk

@trasherdk Here is a monerod binary with the vtnerd's ZMQ fix, please test it if it changes something: https://github.com/selsta/monero/suites/9282067270/artifacts/435113240

selsta avatar Nov 14 '22 02:11 selsta

@selsta I need a static build to test.

$ ./monerod --version
./monerod: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by ./monerod)
./monerod: /lib64/libc.so.6: version `GLIBC_2.25' not found (required by ./monerod)

trasherdk avatar Nov 16 '22 00:11 trasherdk

It is a static build but your glibc is too old. Will make another one.

selsta avatar Nov 16 '22 00:11 selsta

@trasherdk new link https://github.com/selsta/monero/suites/9326922542/artifacts/438397648

You need monero-x86_64-linux-gnu-41ca021ee.tar.bz2.

selsta avatar Nov 16 '22 02:11 selsta

@selsta It's up and running on stagenet. Thanks.

trasherdk avatar Nov 16 '22 03:11 trasherdk

Okay. Both v0.18.1.2-release (testnet) and v0.18.1.1-41ca021ee (stagenet) has been running since yesterday, without any zmq noise. There has been quite a bit of traffic on stagenet zmq the last 24 hours.

It must have been some nastiness on the wire causing the zmq crash.

I'll keep @selsta's version running on stagenet for now.

trasherdk avatar Nov 17 '22 03:11 trasherdk

So you mean both didn't crash? Patched and unpatched?

Also seems I compiled v0.18.1.1 instead of v0.18.1.2, just keep that in mind if you get the "wrong version" error message.

selsta avatar Nov 17 '22 04:11 selsta

@trasherdk I'm leaving this open for now, but we should have the monerod side fixed. Hopefully there's no further issues on the monero-lws-daemon side.

vtnerd avatar Nov 07 '23 20:11 vtnerd

For what it's worth, I'm still getting this error relatively constantly - it's just a matter of time after a restart. (Though I'm not clear if the error is actually emanating from the scanner - checking)

paulshapiro avatar Dec 27 '23 04:12 paulshapiro

@paulshapiro just to confirm, did you apply the monerod zmq fixed manually? They aren't merged yet.

selsta avatar Dec 27 '23 05:12 selsta

Yep, I have tried 2-3 of the candidate fixes from vtnerd patched manually onto source for a monerod build and watched them over a handful of the past months but I've still seen the issue from the monero-lws code. And to confirm (apologies for not having checked before I made my prior post) it looks like what I'm seeing is all somehow related to the "get_dynamic_fee_estimate" zmq client .send(). But I can only see that the error that too many files are open is emanating from the endpoint itself currently, due to the fact the log is coming from here. My system settings as far as open file limits go seem ample too.

paulshapiro avatar Dec 27 '23 06:12 paulshapiro

The monerod side is fixed, and early reports for the LWS 0.3 branch are suggesting that the too many open files issue is fixed too.

vtnerd avatar Apr 11 '24 16:04 vtnerd

This thing seems fixed.

trasherdk avatar May 21 '24 07:05 trasherdk