monero-lws
monero-lws copied to clipboard
Monero zmq crash takes down `monero-lws-daemon`. (Probably)
monerod is running, all good.
tcp 0 0 0.0.0.0:38080 0.0.0.0:* LISTEN 7425/monerod
tcp 0 0 127.0.0.1:38081 0.0.0.0:* LISTEN 7425/monerod
tcp 0 0 1.2.3.4:38082 0.0.0.0:* LISTEN 7425/monerod
tcp 0 0 1.2.3.4:38083 0.0.0.0:* LISTEN 7425/monerod
tcp 0 0 1.2.3.4:38084 0.0.0.0:* LISTEN 7425/monerod
monerod is still running, but zmq is gone.
tcp 0 0 0.0.0.0:38080 0.0.0.0:* LISTEN 6527/monerod
tcp 0 0 127.0.0.1:38081 0.0.0.0:* LISTEN 6527/monerod
tcp 0 0 1.2.3.4:38084 0.0.0.0:* LISTEN 6527/monerod
monero-lws-daemon is running on a different host.
After some time, after zmq is gone, this happens.
2022-11-11 21:34:43.847 I Retrieving current active account list
2022-11-11 21:34:43.847 I No active accounts
2022-11-11 21:34:53.848 I Starting blockchain sync with daemon
2022-11-11 21:35:24.315 W Failed to connect to daemon at tcp://1.2.3.4:38082
2022-11-11 21:35:24.315 I Retrieving current active account list
2022-11-11 21:35:24.315 I No active accounts
2022-11-11 21:35:34.356 E thrown at scanner.cpp:800: Too many open files
There's a shitload of Failed to connect to daemon at tcp://1.2.3.4:38082 before the crash, so how long after, I don't know.
I'm starting to think the lws crash was unrelated to zmq server was down.
Without a log-file on monero-lws-daemon it's impossible to say which one went down first.
The monero-lws-daemon does not have that many file-handles open. 430-450 including events.
I shut monerod down
at 2022-11-12 01:28:18.875 W Failed to connect to daemon at tcp://1.2.3.4:38082
monero-lws-daemon is still running
at 2022-11-12 05:47:02.075 W Failed to connect to daemon at tcp://1.2.3.4:38082
Could it be something it received from zmq ?
This is when zmq crashed on monerod:
2022-11-11 11:25:59.635 [P2P1] INFO net.p2p.msg src/cryptonote_protocol/cryptonote_protocol_handler.inl:550 [188.34.179.54:52902 INC] Received NOTIFY_NEW_FLUFFY_BLOCK <e95a3a4ea3960a337d047fa778a540fd3fe56e89fa46fa4623d07dbe34d8c285> (height 1220696, 0 txes)
2022-11-11 11:25:59.635 [P2P4] INFO perf.blockchain src/common/perf_timer.cpp:120 PERF ----------
2022-11-11 11:25:59.635 [P2P4] INFO perf.blockchain src/common/perf_timer.cpp:156 PERF 7 get_long_term_block_weights
2022-11-11 11:25:59.635 [P2P4] INFO perf.blockchain src/common/perf_timer.cpp:156 PERF 125 update_next_cumulative_weight_limit
2022-11-11 11:25:59.636 [P2P4] INFO blockchain src/cryptonote_core/blockchain.cpp:4524 +++++ BLOCK SUCCESSFULLY ADDED
2022-11-11 11:25:59.636 [P2P4] INFO blockchain src/cryptonote_core/blockchain.cpp:4524 id: <e95a3a4ea3960a337d047fa778a540fd3fe56e89fa46fa4623d07dbe34d8c285>
2022-11-11 11:25:59.636 [P2P4] INFO blockchain src/cryptonote_core/blockchain.cpp:4524 PoW: <ea79e4acc1b0c5289d896a971b107a2a54a2ffe563bf25b90493c84f8a510000>
2022-11-11 11:25:59.636 [P2P4] INFO blockchain src/cryptonote_core/blockchain.cpp:4524 HEIGHT 1220695, difficulty: 111144
2022-11-11 11:25:59.636 [P2P4] INFO blockchain src/cryptonote_core/blockchain.cpp:4524 block reward: 3.538476387438(3.538476387438 + 0.000000000000), coinbase_weight: 85, cumulative weight: 85, 34(0/33)ms
2022-11-11 11:25:59.637 [P2P5] INFO net.p2p.traffic contrib/epee/include/net/levin_protocol_handler_async.h:57 [78.46.102.70:38080 OUT] 187 bytes received for category command-2008 initiated by peer
2022-11-11 11:25:59.637 [P2P5] INFO net.p2p.msg src/cryptonote_protocol/cryptonote_protocol_handler.inl:550 [78.46.102.70:38080 OUT] Received NOTIFY_NEW_FLUFFY_BLOCK <e95a3a4ea3960a337d047fa778a540fd3fe56e89fa46fa4623d07dbe34d8c285> (height 1220696, 0 txes)
2022-11-11 11:25:59.638 [P2P4] ERROR net.zmq src/rpc/zmq_pub.cpp:538 Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-11 11:25:59.638 [P2P4] ERROR net.zmq src/rpc/zmq_pub.cpp:529 Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-11 11:25:59.641 [P2P0] INFO net.p2p.traffic contrib/epee/include/net/levin_protocol_handler_async.h:57 [18.206.13.223:54532 INC] 187 bytes received for category command-2008 initiated by peer
Okay, here we go again.
monerod
2022-11-12 11:18:47.711 I Found block <ae713803b7f463493a421939c51a6bc7f322db359393260ea219736ba282e1c2> at height 1221412 for difficulty: 106982
2022-11-12 11:26:20.469 I Found block <f93d44351b7fb02f7e8af8a6b5554556fe5defa67f376e950ae894c3baeaeb98> at height 1221418 for difficulty: 106276
2022-11-12 11:47:00.976 I Found block <7612be135b7d7ff85a3caab7e2d286ac5f3b1498821ddfd88390b7249412b8c0> at height 1221434 for difficulty: 107536
2022-11-12 12:10:56.926 E ZMQ RPC Server Error: thrown at zmq_server.cpp:161: Resource temporarily unavailable
2022-11-12 12:11:18.102 I Found block <b50745860f385ff1a2a12c88deec69af22cab6e04951c10aaeda7a31398337d1> at height 1221447 for difficulty: 106169
2022-11-12 12:11:18.152 E Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-12 12:11:18.153 E Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-12 12:11:49.473 E Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-12 12:11:49.473 E Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-12 12:13:19.091 E Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-12 12:13:19.091 E Unable to send ZMQ/Pub - ZMQ server destroyed
...
...
2022-11-12 18:26:02.081 E Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-12 18:26:02.081 E Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-12 18:26:52.542 I Found block <ace0338ea93db5f7d9c24dc7300e0b2a55e2fee48c3fd5f2b28754853adae3b6> at height 1221669 for difficulty: 114610
2022-11-12 18:26:52.583 E Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-12 18:26:52.583 E Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-12 18:27:40.895 E Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-12 18:27:52.347 E Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-12 18:27:52.347 E Unable to send ZMQ/Pub - ZMQ server destroyed
monero-lws
2022-11-12 18:25:37.079 W Failed to connect to daemon at tcp://1.2.3.4:38082
2022-11-12 18:25:37.079 I Retrieving current active account list
2022-11-12 18:25:37.079 I No active accounts
2022-11-12 18:25:47.103 I Starting blockchain sync with daemon
2022-11-12 18:26:17.451 W Failed to connect to daemon at tcp://1.2.3.4:38082
2022-11-12 18:26:17.451 I Retrieving current active account list
2022-11-12 18:26:17.452 I No active accounts
2022-11-12 18:26:27.459 I Starting blockchain sync with daemon
2022-11-12 18:26:57.702 W Failed to connect to daemon at tcp://1.2.3.4:38082
2022-11-12 18:26:57.702 I Retrieving current active account list
2022-11-12 18:26:57.702 I No active accounts
2022-11-12 18:27:07.740 E thrown at scanner.cpp:800: Too many open files
And, here's the testnet crash.
2022-11-12 12:10:56.683 7f48f57fa700 ERROR net.zmq src/rpc/zmq_server.cpp:174 ZMQ RPC Server Error: thrown at zmq_server.cpp:161: Resource temporarily unavailable
2022-11-12 12:12:25.187 [P2P6] WARNING cn src/cryptonote_core/cryptonote_core.cpp:2048 There were 17 blocks in the last 90 minutes, there might be large hash rate changes, or we might be partitioned, cut off from the Monero network or under attack, or your computer's time is off. Or it could be just sheer bad luck.
2022-11-12 12:13:55.304 [P2P8] WARNING cn src/cryptonote_core/cryptonote_core.cpp:2048 There were 16 blocks in the last 90 minutes, there might be large hash rate changes, or we might be partitioned, cut off from the Monero network or under attack, or your computer's time is off. Or it could be just sheer bad luck.
2022-11-12 12:15:25.436 [P2P3] WARNING cn src/cryptonote_core/cryptonote_core.cpp:2048 There were 16 blocks in the last 90 minutes, there might be large hash rate changes, or we might be partitioned, cut off from the Monero network or under attack, or your computer's time is off. Or it could be just sheer bad luck.
2022-11-12 12:16:55.655 [P2P0] WARNING cn src/cryptonote_core/cryptonote_core.cpp:2048 There were 16 blocks in the last 90 minutes, there might be large hash rate changes, or we might be partitioned, cut off from the Monero network or under attack, or your computer's time is off. Or it could be just sheer bad luck.
2022-11-12 12:18:25.831 [P2P4] WARNING cn src/cryptonote_core/cryptonote_core.cpp:2048 There were 16 blocks in the last 90 minutes, there might be large hash rate changes, or we might be partitioned, cut off from the Monero network or under attack, or your computer's time is off. Or it could be just sheer bad luck.
2022-11-12 12:19:55.908 [P2P1] WARNING cn src/cryptonote_core/cryptonote_core.cpp:2048 There were 15 blocks in the last 90 minutes, there might be large hash rate changes, or we might be partitioned, cut off from the Monero network or under attack, or your computer's time is off. Or it could be just sheer bad luck.
2022-11-12 12:20:41.110 [P2P1] ERROR net.zmq src/rpc/zmq_pub.cpp:538 Unable to send ZMQ/Pub - ZMQ server destroyed
2022-11-12 12:20:41.110 [P2P1] ERROR net.zmq src/rpc/zmq_pub.cpp:529 Unable to send ZMQ/Pub - ZMQ server destroyed
I got a little more info from the latest crash.
2022-11-13 19:42:00.245 W Failed to connect to daemon at tcp://1.2.3.4:38082
2022-11-13 19:42:00.245 I Retrieving current active account list
2022-11-13 19:42:00.246 I No active accounts
2022-11-13 19:42:10.252 D Problems at ssl shutdown: uninitialized (SSL routines, SSL_shutdown)
2022-11-13 19:42:10.253 D Problems at cancel: Bad file descriptor
2022-11-13 19:42:10.253 D Problems at shutdown: Bad file descriptor
2022-11-13 19:42:10.258 D Destructing connection #0 to 0.0.0.0
2022-11-13 19:42:10.258 E thrown at scanner.cpp:800: Too many open files
Turning log-level up to 4 and restarting....
Restarting monero-lws-daemon got me this:
2022-11-13 23:58:37.170 I Using monerod ZMQ RPC at tcp://1.2.3.4:38082
2022-11-13 23:58:37.170 I Starting blockchain sync with daemon
2022-11-13 23:59:07.715 D Problems at ssl shutdown: uninitialized (SSL routines, SSL_shutdown)
2022-11-13 23:59:07.715 D Problems at cancel: Bad file descriptor
2022-11-13 23:59:07.715 D Problems at shutdown: Bad file descriptor
2022-11-13 23:59:07.716 E Connection failed with daemon
Restarting monerod, ZMQ was down, and restarting monero-lws-daemon.. waiting...
@trasherdk Here is a monerod binary with the vtnerd's ZMQ fix, please test it if it changes something: https://github.com/selsta/monero/suites/9282067270/artifacts/435113240
@selsta I need a static build to test.
$ ./monerod --version
./monerod: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by ./monerod)
./monerod: /lib64/libc.so.6: version `GLIBC_2.25' not found (required by ./monerod)
It is a static build but your glibc is too old. Will make another one.
@trasherdk new link https://github.com/selsta/monero/suites/9326922542/artifacts/438397648
You need monero-x86_64-linux-gnu-41ca021ee.tar.bz2.
@selsta It's up and running on stagenet. Thanks.
Okay. Both v0.18.1.2-release (testnet) and v0.18.1.1-41ca021ee (stagenet) has been running since yesterday, without any zmq noise. There has been quite a bit of traffic on stagenet zmq the last 24 hours.
It must have been some nastiness on the wire causing the zmq crash.
I'll keep @selsta's version running on stagenet for now.
So you mean both didn't crash? Patched and unpatched?
Also seems I compiled v0.18.1.1 instead of v0.18.1.2, just keep that in mind if you get the "wrong version" error message.
@trasherdk I'm leaving this open for now, but we should have the monerod side fixed. Hopefully there's no further issues on the monero-lws-daemon side.
For what it's worth, I'm still getting this error relatively constantly - it's just a matter of time after a restart. (Though I'm not clear if the error is actually emanating from the scanner - checking)
@paulshapiro just to confirm, did you apply the monerod zmq fixed manually? They aren't merged yet.
Yep, I have tried 2-3 of the candidate fixes from vtnerd patched manually onto source for a monerod build and watched them over a handful of the past months but I've still seen the issue from the monero-lws code. And to confirm (apologies for not having checked before I made my prior post) it looks like what I'm seeing is all somehow related to the "get_dynamic_fee_estimate" zmq client .send(). But I can only see that the error that too many files are open is emanating from the endpoint itself currently, due to the fact the log is coming from here. My system settings as far as open file limits go seem ample too.
The monerod side is fixed, and early reports for the LWS 0.3 branch are suggesting that the too many open files issue is fixed too.
This thing seems fixed.