zebra
zebra copied to clipboard
Security: Remove the ErrorSlot from the heartbeat and connection tasks
Motivation
The error handling for the Connection
and heartbeat tasks is really messy. It has been really buggy in the past, and it's hard to understand or modify. The way the code is written can cause hangs or panics.
At the moment, it's causing a "failed servers must set their error slot" panic when Zebra is overloaded.
Scheduling
This is a denial of service risk which might be triggerable remotely.
This is high-risk hang code with multiple mutexes, so it should be prioritised over other potential hang or panic bugs.
Sub-Tasks
High-priority ongoing connection panics:
- [ ] #4770
- [x] #4733
- [ ] #4734
- [ ] #4735
Other panics that are very rare:
- [ ] #4736
Cleanups:
- [x] #4737
This will give us a much stronger guarantee that the connection task will exit when there's an error.
Related Work
- #3254
redistributing issues that can be separately worked from Epic #2311
Hey team! Please add your planning poker estimate with ZenHub @conradoplg @dconnolly @gustavovalverde @jvff @oxarbitrage @teor2345 @upbqdn
This cleanup isn't needed any time soon.
I got a panic in this code, so I'm re-opening this ticket.
I'll add the panic info to the ticket, and the details in a comment.
Here is the panic I saw recently:
Message: failed servers must set their error slot
Location: zebra-network/src/peer/client.rs:496
It looks like it might have been triggered by switching to another chain fork.
Here are the detailed logs:
2022-06-24T14:24:44.018156Z INFO {zebrad="135aedc" net="Main"}: zebrad::components::mempool::gossip: sending mempool transaction broadcast request=AdvertiseTransactionIds({Witnessed(WtxId { id: transaction::Hash("3fa349d6b60b6a70e23af69513faceafc52ce13061a0538de364691cac2a8252"), auth_digest: AuthDigest("9a830fe29ab5db38201f44e662aa85edd146f943f687b467ad5bfd270888c5e5") })})
2022-06-24T14:24:46.446497Z INFO {zebrad="135aedc" net="Main"}: zebrad::components::mempool::gossip: sending mempool transaction broadcast request=AdvertiseTransactionIds({Witnessed(WtxId { id: transaction::Hash("590e22047e713accbbb12062e178a78948da9ca20d05fa881fcd730333010609"), auth_digest: AuthDigest("ec7dd1511d4c99b4d6907880b3097de0d04be8ac22a93c84d110e5449beb6b03") })})
2022-06-24T14:24:56.394244Z INFO {zebrad="135aedc" net="Main"}:{peer=Out("176.34.40.41:8233")}:msg_as_req{msg="inv"}:inbound:download_and_verify{hash=0000000000b8a01c82a9df285251995b67492dc301e14685b362dbf04df41736}: zebrad::components::inbound::downloads: downloaded and verified gossiped block height=Height(1714438)
2022-06-24T14:24:59.349912Z INFO {zebrad="135aedc" net="Main"}: zebrad::commands::start: finished initial sync to chain tip, using gossiped blocks sync_percent=100.000 % current_height=Height(1714438) remaining_sync_blocks=0 time_since_last_state_block=PT0S
2022-06-24T14:25:24.130776Z INFO {zebrad="135aedc" net="Main"}:{peer=Out("157.90.88.178:9058")}:msg_as_req{msg="inv"}:inbound:download_and_verify{hash=0000000000b41efb77d297d134f441cc5e18ef2605f694a90f45d4d89e0dcab6}: zebrad::components::inbound::downloads: downloaded and verified gossiped block height=Height(1714438)
2022-06-24T14:25:24.131484Z INFO {zebrad="135aedc" net="Main"}:{peer=Out("8.209.65.101:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
2022-06-24T14:25:24.131481Z INFO {zebrad="135aedc" net="Main"}:{peer=Out("52.28.203.21:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
2022-06-24T14:25:24.131285Z INFO {zebrad="135aedc" net="Main"}: zebrad::components::mempool: resetting mempool: switched best chain, skipped blocks, or activated network upgrade tip_height=Height(1714438)
2022-06-24T14:25:24.131403Z INFO {zebrad="135aedc" net="Main"}:{peer=Out("209.141.47.197:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
(repeated approximately 75 times)
The application panicked (crashed).
Message: failed servers must set their error slot
Location: zebra-network/src/peer/client.rs:496
Metadata:
version: 1.0.0-beta.11+48.g135aedc
Zcash network: Mainnet
state version: 24
branch: min-lookahead-limit
git commit: 135aedc
commit timestamp: 2022-06-22T03:34:32Z
target triple: x86_64-unknown-linux-gnu
build profile: release
0: zebrad::components::inbound::downloads::download_and_verify with hash=0000000000b8a01c82a9df285251995b67492dc301e14685b362dbf04df41736
at zebrad/src/components/inbound/downloads.rs:1792022-06-24T14:25:24.131597Z INFO {zebrad="135aedc" net="Main"}:{peer=Out("3.249.165.235:5001")}:msg_as_req{msg="inv"}: zebra_network::p
eer::connection: inbound service is overloaded, closing connection
2022-06-24T14:25:24.131603Z INFO {zebrad="135aedc" net="Main"}:{peer=Out("84.75.28.247:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
2022-06-24T14:25:24.131605Z INFO {zebrad="135aedc" net="Main"}:{peer=Out("147.135.11.134:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
2022-06-24T14:25:24.131611Z INFO {zebrad="135aedc" net="Main"}:{peer=Out("45.43.30.2:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
2022-06-24T14:25:24.131612Z INFO {zebrad="135aedc" net="Main"}:{peer=Out("52.211.176.240:5001")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
2022-06-24T14:25:24.131613Z INFO {zebrad="135aedc" net="Main"}:{peer=Out("213.252.244.214:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
2022-06-24T14:25:24.131637Z INFO {zebrad="135aedc" net="Main"}:{peer=Out("176.31.239.109:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
1: zebrad::components::inbound::inbound
at zebrad/src/components/inbound.rs:289
2: zebra_network::peer::connection::msg_as_req with msg="inv"
at zebra-network/src/peer/connection.rs:979
3: zebra_network::peer::handshake:: with peer=Out("203.96.179.202:8233")
at zebra-network/src/peer/handshake.rs:791
4: zebrad::application:: with zebrad="135aedc" net="Main"
at zebrad/src/application.rs:376
ticket to be split and re-estimated.
I think I might have found another instance of this bug, but it results in a syncer hang, not a panic.
The logs look like:
2022-06-30T10:06:24.137697Z WARN {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 38 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=99.914% current_height=Height(1719629) network_upgrade=Nu5 time_since_last_state_block=38m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
2022-06-30T10:07:18.412883Z WARN {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: error downloading and verifying block e=ValidationRequestError { error: Elapsed(()), height: Height(1719631), hash: block::Hash("00000000000db5ffa65d7581bfe54a9aa39e7d257d1920ae76eae2e34974636e") }
2022-06-30T10:07:18.413029Z INFO {zebrad="8ff5146" net="Main"}:sync: zebrad::components::sync: waiting to restart sync timeout=67s state_tip=Some(Height(1719629))
2022-06-30T10:09:59.266078Z INFO {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: starting sync, obtaining new tips state_tip=Some(Height(1719629))
2022-06-30T10:09:59.266146Z WARN {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 41 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=99.913% current_height=Height(1719629) network_upgrade=Nu5 time_since_last_state_block=41m 35s target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
2022-06-30T10:09:59.266574Z INFO {zebrad="8ff5146" net="Main"}:{peer=Out("161.53.201.188:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
Here are the full logs from the start of the hang:
2022-06-30T09:24:24.090749Z INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719622) network_upgrade=Nu5 remaining_sync_blocks=1467 time_since_last_state_block=0s
2022-06-30T09:24:42.602806Z INFO {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: waiting for pending blocks tips.len=1 in_flight=9 extra_hashes=463 lookahead_limit=5
2022-06-30T09:25:24.092622Z INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719627) network_upgrade=Nu5 remaining_sync_blocks=1460 time_since_last_state_block=0s
2022-06-30T09:26:24.092922Z INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719627) network_upgrade=Nu5 remaining_sync_blocks=1461 time_since_last_state_block=1m
2022-06-30T09:27:24.094536Z INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719627) network_upgrade=Nu5 remaining_sync_blocks=1461 time_since_last_state_block=2m
2022-06-30T09:27:54.159075Z INFO {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: waiting for pending blocks tips.len=1 in_flight=8 extra_hashes=458 lookahead_limit=5
2022-06-30T09:28:24.095393Z INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1458 time_since_last_state_block=0s
2022-06-30T09:29:24.096079Z INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1459 time_since_last_state_block=1m
2022-06-30T09:30:24.097127Z INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1460 time_since_last_state_block=2m
2022-06-30T09:31:24.098606Z INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1460 time_since_last_state_block=3m
2022-06-30T09:32:24.099312Z INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1461 time_since_last_state_block=4m
2022-06-30T09:32:37.021862Z INFO {zebrad="8ff5146" net="Main"}:crawl_and_dial{new_peer_interval=61s}: zebra_network::peer_set::candidate_set: timeout waiting for peer service readiness or peer responses
2022-06-30T09:33:24.100697Z INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1462 time_since_last_state_block=5m
2022-06-30T09:33:54.462655Z WARN {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: error downloading and verifying block e=Invalid { error: Block(Transaction(InternalDowncastError("downcast to known transaction error type failed, original error: Elapsed(())"))), height: Height(1719636), hash: block::Hash("000000000116e5c27c040e82740852a1eb60365360063203161c1ae8e5689ebf") }
2022-06-30T09:33:54.463531Z INFO {zebrad="8ff5146" net="Main"}:sync: zebrad::components::sync: waiting to restart sync timeout=67s state_tip=Some(Height(1719629))
2022-06-30T09:34:24.102078Z INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1463 time_since_last_state_block=6m
2022-06-30T09:34:39.021738Z INFO {zebrad="8ff5146" net="Main"}:crawl_and_dial{new_peer_interval=61s}: zebra_network::peer_set::candidate_set: timeout waiting for peer service readiness or peer responses
2022-06-30T09:35:01.464711Z INFO {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: starting sync, obtaining new tips state_tip=Some(Height(1719629))
2022-06-30T09:35:03.020050Z INFO {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: waiting for pending blocks tips.len=1 in_flight=5 extra_hashes=494 lookahead_limit=5
2022-06-30T09:35:24.103447Z INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1464 time_since_last_state_block=7m
2022-06-30T09:36:24.104184Z INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1464 time_since_last_state_block=8m
2022-06-30T09:37:24.105729Z INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1465 time_since_last_state_block=9m
2022-06-30T09:38:24.107319Z WARN {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 10 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 time_since_last_state_block=10m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
2022-06-30T09:39:24.108535Z WARN {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 11 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 time_since_last_state_block=11m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
2022-06-30T09:40:24.109445Z WARN {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 12 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 time_since_last_state_block=12m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
...(similar logs for 25 minutes)...
2022-06-30T10:06:24.137697Z WARN {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 38 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=99.914% current_height=Height(1719629) network_upgrade=Nu5 time_since_last_state_block=38m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
2022-06-30T10:07:18.412883Z WARN {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: error downloading and verifying block e=ValidationRequestError { error: Elapsed(()), height: Height(1719631), hash: block::Hash("00000000000db5ffa65d7581bfe54a9aa39e7d257d1920ae76eae2e34974636e") }
2022-06-30T10:07:18.413029Z INFO {zebrad="8ff5146" net="Main"}:sync: zebrad::components::sync: waiting to restart sync timeout=67s state_tip=Some(Height(1719629))
2022-06-30T10:09:59.266078Z INFO {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: starting sync, obtaining new tips state_tip=Some(Height(1719629))
2022-06-30T10:09:59.266146Z WARN {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 41 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=99.913% current_height=Height(1719629) network_upgrade=Nu5 time_since_last_state_block=41m 35s target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
2022-06-30T10:09:59.266574Z INFO {zebrad="8ff5146" net="Main"}:{peer=Out("161.53.201.188:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
2022-06-30T10:09:59.266574Z INFO {zebrad="8ff5146" net="Main"}:{peer=Out("135.181.210.43:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
2022-06-30T10:09:59.266579Z INFO {zebrad="8ff5146" net="Main"}:{peer=Out("104.59.147.15:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
2022-06-30T10:09:59.266603Z INFO {zebrad="8ff5146" net="Main"}:{peer=Out("65.108.79.158:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
2022-06-30T10:09:59.266613Z INFO {zebrad="8ff5146" net="Main"}:{peer=Out("185.189.149.189:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
... (repeated about 75 times) ...
Here is the panic I saw recently:
Message: failed servers must set their error slot Location: zebra-network/src/peer/client.rs:496
I saw another instance of the same panic today.
There was no chain fork, so it seems like it is just caused by inbound service overload.
I'm seeing panics in this code every few days now.