bee salud IsHealthy using wrong radius

Context

v2.1.0 (and earlier)

Summary

Several of my sepolia testnet nodes are not participating in the storage compensation rounds. All of these nodes have storage radius 4 while the remainder of the swarm has increased to radius 5. Radius 4 is CORRECT for these lesser-populated neighborhoods.

The nodes are logging:

"time"="2024-05-29 07:22:14.260110" "level"="info" "logger"="node/storageincentives" "msg"="skipping round because node is unhealhy" "round"=39473

and

"time"="2024-05-29 07:56:05.200700" "level"="warning" "logger"="node/salud" "msg"="node is unhealthy due to storage radius discrepency" "self_radius"=4 "network_radius"=5

Expected behavior

If a node has the same radius as its neighborhood peers, then it must be healthy, regardless of what the radius is in other neighborhoods.

Actual behavior

Because other neighborhoods in the swarm have increased to radius 5, the lesser-populated neighborhoods are not participating in the storage compensation.

Steps to reproduce

Just fire up a node in one of the lesser-populated, radius 4 sepolia testnet neighborhoods. Specifically (at this point in time): 0x480, 0xb80, 0xc80, 0xd00, 0xdef, 0xe80

Possible solution

Use a neighborhood radius calculation for health rather than the overall swarm which may be different.

Here's the /status/peers output of one of the affected nodes. 4635-status-peers.txt

May 29 '24 12:05 ldeffenb

Here is the /status output for each of the radius 4 nodes/neighborhoods:

  "peer": "480...",
  "proximity": 0,
  "beeMode": "full",
  "reserveSize": 4193860,
  "reserveSizeWithinRadius": 3426466,
  "pullsyncRate": 0,
  "storageRadius": 4,
  "connectedPeers": 41,
  "neighborhoodSize": 0,
  "batchCommitment": 2757492736,
  "isReachable": true

  "peer": "b80...",
  "proximity": 0,
  "beeMode": "full",
  "reserveSize": 4193515,
  "reserveSizeWithinRadius": 3430038,
  "pullsyncRate": 0,
  "storageRadius": 4,
  "connectedPeers": 41,
  "neighborhoodSize": 0,
  "batchCommitment": 2757492736,
  "isReachable": true

  "peer": "c80...",
  "proximity": 0,
  "beeMode": "full",
  "reserveSize": 4176960,
  "reserveSizeWithinRadius": 3810215,
  "pullsyncRate": 0,
  "storageRadius": 4,
  "connectedPeers": 41,
  "neighborhoodSize": 0,
  "batchCommitment": 2757492736,
  "isReachable": true

  "peer": "d00...",
  "proximity": 0,
  "beeMode": "full",
  "reserveSize": 4093335,
  "reserveSizeWithinRadius": 4043948,
  "pullsyncRate": 0,
  "storageRadius": 4,
  "connectedPeers": 41,
  "neighborhoodSize": 2,
  "batchCommitment": 2757492736,
  "isReachable": true

  "peer": "def...
  "proximity": 0,
  "beeMode": "full",
  "reserveSize": 4159288,
  "reserveSizeWithinRadius": 4043952,
  "pullsyncRate": 0,
  "storageRadius": 4,
  "connectedPeers": 37,
  "neighborhoodSize": 2,
  "batchCommitment": 2757492736,
  "isReachable": true

  "peer": "e80...",
  "proximity": 0,
  "beeMode": "full",
  "reserveSize": 4181723,
  "reserveSizeWithinRadius": 3733244,
  "pullsyncRate": 0,
  "storageRadius": 4,
  "connectedPeers": 41,
  "neighborhoodSize": 0,
  "batchCommitment": 2757492736,
  "isReachable": true

If you compare those reserveSizeWithinRadius to the radius 5 nodes in the attached /status/peers file, you'll notice that the radius 4 have almost full reserves while the radius 5 nodes are only about 1/2 full; consistent with a recent increase of radius that didn't land uniformly across the swarm.

If this can happen in testnet and sustain for several days (as it has, until I noticed), then it can certainly happen in mainnet and be missed across the 1,024, 2,048, or even 4,096 neighborhoods.

May 29 '24 12:05 ldeffenb

Interestingly, salud allows peers to be one less than the network radius (scroll right to see the -1): https://github.com/ethersphere/bee/blob/97e7ee699be3b4325a233b1ca2dc177cd88f17e1/pkg/salud/salud.go#L203 But requires itself to be equal to the network radius: https://github.com/ethersphere/bee/blob/97e7ee699be3b4325a233b1ca2dc177cd88f17e1/pkg/salud/salud.go#L225

May 29 '24 12:05 ldeffenb

Made an improvement regarding this issue in PR-4721 that we think could help solve this issue.

Using neighbourhood radius for calculation of node radius in that neighbourhood is not reliable. Like described in the PR, comparing node radius to neighbourhood radius could lead to issues and also to attacks from bad intended actors in that neighbourhood.

Also regarding peers radius comparison, that comparison is the way it is to have an acceptable tolerance for the peers.

Jul 11 '24 20:07 martinconic