zebra icon indicating copy to clipboard operation
zebra copied to clipboard

Security: Remove the ErrorSlot from the heartbeat and connection tasks

Open teor2345 opened this issue 2 years ago • 9 comments

Motivation

The error handling for the Connection and heartbeat tasks is really messy. It has been really buggy in the past, and it's hard to understand or modify. The way the code is written can cause hangs or panics.

At the moment, it's causing a "failed servers must set their error slot" panic when Zebra is overloaded.

Scheduling

This is a denial of service risk which might be triggerable remotely.

This is high-risk hang code with multiple mutexes, so it should be prioritised over other potential hang or panic bugs.

Sub-Tasks

High-priority ongoing connection panics:

  • [ ] #4770
  • [x] #4733
  • [ ] #4734
  • [ ] #4735

Other panics that are very rare:

  • [ ] #4736

Cleanups:

  • [x] #4737

This will give us a much stronger guarantee that the connection task will exit when there's an error.

Related Work

  • #3254

teor2345 avatar Dec 20 '21 01:12 teor2345

redistributing issues that can be separately worked from Epic #2311

ftm1000 avatar Jan 26 '22 18:01 ftm1000

Hey team! Please add your planning poker estimate with ZenHub @conradoplg @dconnolly @gustavovalverde @jvff @oxarbitrage @teor2345 @upbqdn

ftm1000 avatar Jan 27 '22 16:01 ftm1000

This cleanup isn't needed any time soon.

teor2345 avatar Mar 01 '22 04:03 teor2345

I got a panic in this code, so I'm re-opening this ticket.

I'll add the panic info to the ticket, and the details in a comment.

teor2345 avatar Jun 27 '22 07:06 teor2345

Here is the panic I saw recently:

Message:  failed servers must set their error slot
Location: zebra-network/src/peer/client.rs:496

It looks like it might have been triggered by switching to another chain fork.

Here are the detailed logs:

2022-06-24T14:24:44.018156Z  INFO {zebrad="135aedc" net="Main"}: zebrad::components::mempool::gossip: sending mempool transaction broadcast request=AdvertiseTransactionIds({Witnessed(WtxId { id: transaction::Hash("3fa349d6b60b6a70e23af69513faceafc52ce13061a0538de364691cac2a8252"), auth_digest: AuthDigest("9a830fe29ab5db38201f44e662aa85edd146f943f687b467ad5bfd270888c5e5") })})
2022-06-24T14:24:46.446497Z  INFO {zebrad="135aedc" net="Main"}: zebrad::components::mempool::gossip: sending mempool transaction broadcast request=AdvertiseTransactionIds({Witnessed(WtxId { id: transaction::Hash("590e22047e713accbbb12062e178a78948da9ca20d05fa881fcd730333010609"), auth_digest: AuthDigest("ec7dd1511d4c99b4d6907880b3097de0d04be8ac22a93c84d110e5449beb6b03") })})   
2022-06-24T14:24:56.394244Z  INFO {zebrad="135aedc" net="Main"}:{peer=Out("176.34.40.41:8233")}:msg_as_req{msg="inv"}:inbound:download_and_verify{hash=0000000000b8a01c82a9df285251995b67492dc301e14685b362dbf04df41736}: zebrad::components::inbound::downloads: downloaded and verified gossiped block height=Height(1714438)
2022-06-24T14:24:59.349912Z  INFO {zebrad="135aedc" net="Main"}: zebrad::commands::start: finished initial sync to chain tip, using gossiped blocks sync_percent=100.000 % current_height=Height(1714438) remaining_sync_blocks=0 time_since_last_state_block=PT0S                                                                                                                            
2022-06-24T14:25:24.130776Z  INFO {zebrad="135aedc" net="Main"}:{peer=Out("157.90.88.178:9058")}:msg_as_req{msg="inv"}:inbound:download_and_verify{hash=0000000000b41efb77d297d134f441cc5e18ef2605f694a90f45d4d89e0dcab6}: zebrad::components::inbound::downloads: downloaded and verified gossiped block height=Height(1714438)
2022-06-24T14:25:24.131484Z  INFO {zebrad="135aedc" net="Main"}:{peer=Out("8.209.65.101:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection                                                                                                                                                                                     
2022-06-24T14:25:24.131481Z  INFO {zebrad="135aedc" net="Main"}:{peer=Out("52.28.203.21:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
2022-06-24T14:25:24.131285Z  INFO {zebrad="135aedc" net="Main"}: zebrad::components::mempool: resetting mempool: switched best chain, skipped blocks, or activated network upgrade tip_height=Height(1714438)                               
2022-06-24T14:25:24.131403Z  INFO {zebrad="135aedc" net="Main"}:{peer=Out("209.141.47.197:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection                                   
(repeated approximately 75 times)

The application panicked (crashed).
Message:  failed servers must set their error slot
Location: zebra-network/src/peer/client.rs:496

Metadata:
version: 1.0.0-beta.11+48.g135aedc
Zcash network: Mainnet
state version: 24
branch: min-lookahead-limit
git commit: 135aedc
commit timestamp: 2022-06-22T03:34:32Z
target triple: x86_64-unknown-linux-gnu
build profile: release

   0: zebrad::components::inbound::downloads::download_and_verify with hash=0000000000b8a01c82a9df285251995b67492dc301e14685b362dbf04df41736                                                   
      at zebrad/src/components/inbound/downloads.rs:1792022-06-24T14:25:24.131597Z  INFO {zebrad="135aedc" net="Main"}:{peer=Out("3.249.165.235:5001")}:msg_as_req{msg="inv"}: zebra_network::p
eer::connection: inbound service is overloaded, closing connection
2022-06-24T14:25:24.131603Z  INFO {zebrad="135aedc" net="Main"}:{peer=Out("84.75.28.247:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection                                     
2022-06-24T14:25:24.131605Z  INFO {zebrad="135aedc" net="Main"}:{peer=Out("147.135.11.134:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection                                   
2022-06-24T14:25:24.131611Z  INFO {zebrad="135aedc" net="Main"}:{peer=Out("45.43.30.2:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection                                       
2022-06-24T14:25:24.131612Z  INFO {zebrad="135aedc" net="Main"}:{peer=Out("52.211.176.240:5001")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection                                   
2022-06-24T14:25:24.131613Z  INFO {zebrad="135aedc" net="Main"}:{peer=Out("213.252.244.214:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection                                  
2022-06-24T14:25:24.131637Z  INFO {zebrad="135aedc" net="Main"}:{peer=Out("176.31.239.109:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection 

   1: zebrad::components::inbound::inbound
      at zebrad/src/components/inbound.rs:289

   2: zebra_network::peer::connection::msg_as_req with msg="inv"
      at zebra-network/src/peer/connection.rs:979

   3: zebra_network::peer::handshake:: with peer=Out("203.96.179.202:8233")
      at zebra-network/src/peer/handshake.rs:791

   4: zebrad::application:: with zebrad="135aedc" net="Main"
      at zebrad/src/application.rs:376

teor2345 avatar Jun 27 '22 07:06 teor2345

ticket to be split and re-estimated.

ftm1000 avatar Jun 27 '22 21:06 ftm1000

I think I might have found another instance of this bug, but it results in a syncer hang, not a panic.

The logs look like:

2022-06-30T10:06:24.137697Z  WARN {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 38 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=99.914% current_height=Height(1719629) network_upgrade=Nu5 time_since_last_state_block=38m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
2022-06-30T10:07:18.412883Z  WARN {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: error downloading and verifying block e=ValidationRequestError { error: Elapsed(()), height: Height(1719631), hash: block::Hash("00000000000db5ffa65d7581bfe54a9aa39e7d257d1920ae76eae2e34974636e") }
2022-06-30T10:07:18.413029Z  INFO {zebrad="8ff5146" net="Main"}:sync: zebrad::components::sync: waiting to restart sync timeout=67s state_tip=Some(Height(1719629))
2022-06-30T10:09:59.266078Z  INFO {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: starting sync, obtaining new tips state_tip=Some(Height(1719629))
2022-06-30T10:09:59.266146Z  WARN {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 41 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=99.913% current_height=Height(1719629) network_upgrade=Nu5 time_since_last_state_block=41m 35s target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
2022-06-30T10:09:59.266574Z  INFO {zebrad="8ff5146" net="Main"}:{peer=Out("161.53.201.188:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection

Here are the full logs from the start of the hang:

2022-06-30T09:24:24.090749Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719622) network_upgrade=Nu5 remaining_sync_blocks=1467 time_since_last_state_block=0s
2022-06-30T09:24:42.602806Z  INFO {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: waiting for pending blocks tips.len=1 in_flight=9 extra_hashes=463 lookahead_limit=5
2022-06-30T09:25:24.092622Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719627) network_upgrade=Nu5 remaining_sync_blocks=1460 time_since_last_state_block=0s
2022-06-30T09:26:24.092922Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719627) network_upgrade=Nu5 remaining_sync_blocks=1461 time_since_last_state_block=1m
2022-06-30T09:27:24.094536Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719627) network_upgrade=Nu5 remaining_sync_blocks=1461 time_since_last_state_block=2m
2022-06-30T09:27:54.159075Z  INFO {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: waiting for pending blocks tips.len=1 in_flight=8 extra_hashes=458 lookahead_limit=5
2022-06-30T09:28:24.095393Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1458 time_since_last_state_block=0s
2022-06-30T09:29:24.096079Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1459 time_since_last_state_block=1m
2022-06-30T09:30:24.097127Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1460 time_since_last_state_block=2m
2022-06-30T09:31:24.098606Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1460 time_since_last_state_block=3m
2022-06-30T09:32:24.099312Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1461 time_since_last_state_block=4m
2022-06-30T09:32:37.021862Z  INFO {zebrad="8ff5146" net="Main"}:crawl_and_dial{new_peer_interval=61s}: zebra_network::peer_set::candidate_set: timeout waiting for peer service readiness or peer responses
2022-06-30T09:33:24.100697Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1462 time_since_last_state_block=5m
2022-06-30T09:33:54.462655Z  WARN {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: error downloading and verifying block e=Invalid { error: Block(Transaction(InternalDowncastError("downcast to known transaction error type failed, original error: Elapsed(())"))), height: Height(1719636), hash: block::Hash("000000000116e5c27c040e82740852a1eb60365360063203161c1ae8e5689ebf") }
2022-06-30T09:33:54.463531Z  INFO {zebrad="8ff5146" net="Main"}:sync: zebrad::components::sync: waiting to restart sync timeout=67s state_tip=Some(Height(1719629))
2022-06-30T09:34:24.102078Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1463 time_since_last_state_block=6m
2022-06-30T09:34:39.021738Z  INFO {zebrad="8ff5146" net="Main"}:crawl_and_dial{new_peer_interval=61s}: zebra_network::peer_set::candidate_set: timeout waiting for peer service readiness or peer responses
2022-06-30T09:35:01.464711Z  INFO {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: starting sync, obtaining new tips state_tip=Some(Height(1719629))
2022-06-30T09:35:03.020050Z  INFO {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: waiting for pending blocks tips.len=1 in_flight=5 extra_hashes=494 lookahead_limit=5
2022-06-30T09:35:24.103447Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1464 time_since_last_state_block=7m
2022-06-30T09:36:24.104184Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1464 time_since_last_state_block=8m
2022-06-30T09:37:24.105729Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1465 time_since_last_state_block=9m
2022-06-30T09:38:24.107319Z  WARN {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 10 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 time_since_last_state_block=10m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
2022-06-30T09:39:24.108535Z  WARN {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 11 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 time_since_last_state_block=11m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
2022-06-30T09:40:24.109445Z  WARN {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 12 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 time_since_last_state_block=12m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
...(similar logs for 25 minutes)...
2022-06-30T10:06:24.137697Z  WARN {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 38 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=99.914% current_height=Height(1719629) network_upgrade=Nu5 time_since_last_state_block=38m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
2022-06-30T10:07:18.412883Z  WARN {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: error downloading and verifying block e=ValidationRequestError { error: Elapsed(()), height: Height(1719631), hash: block::Hash("00000000000db5ffa65d7581bfe54a9aa39e7d257d1920ae76eae2e34974636e") }
2022-06-30T10:07:18.413029Z  INFO {zebrad="8ff5146" net="Main"}:sync: zebrad::components::sync: waiting to restart sync timeout=67s state_tip=Some(Height(1719629))
2022-06-30T10:09:59.266078Z  INFO {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: starting sync, obtaining new tips state_tip=Some(Height(1719629))
2022-06-30T10:09:59.266146Z  WARN {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 41 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=99.913% current_height=Height(1719629) network_upgrade=Nu5 time_since_last_state_block=41m 35s target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
2022-06-30T10:09:59.266574Z  INFO {zebrad="8ff5146" net="Main"}:{peer=Out("161.53.201.188:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
2022-06-30T10:09:59.266574Z  INFO {zebrad="8ff5146" net="Main"}:{peer=Out("135.181.210.43:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
2022-06-30T10:09:59.266579Z  INFO {zebrad="8ff5146" net="Main"}:{peer=Out("104.59.147.15:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
2022-06-30T10:09:59.266603Z  INFO {zebrad="8ff5146" net="Main"}:{peer=Out("65.108.79.158:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
2022-06-30T10:09:59.266613Z  INFO {zebrad="8ff5146" net="Main"}:{peer=Out("185.189.149.189:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
... (repeated about 75 times) ...

teor2345 avatar Jun 30 '22 21:06 teor2345

Here is the panic I saw recently:

Message:  failed servers must set their error slot
Location: zebra-network/src/peer/client.rs:496

I saw another instance of the same panic today.

There was no chain fork, so it seems like it is just caused by inbound service overload.

teor2345 avatar Jul 04 '22 00:07 teor2345

I'm seeing panics in this code every few days now.

teor2345 avatar Jul 10 '22 21:07 teor2345