zenoh icon indicating copy to clipboard operation
zenoh copied to clipboard

Gossip list not being cleared

Open william-swarmbotics opened this issue 5 months ago • 3 comments

Describe the bug

On a Zenoh network doing peer-to-peer multicast scouting with gossip enabled, we experienced sudden, rapid degradation in Zenoh's ability to communicate, until it generally could not deliver messages at all. We discovered that the OAM message had become very large, in the tens of kilobytes, and had thousands of peer IDs listed. This caused connections to drop because the OAM message was sent blocking, and if it failed to get delivered (or blocked other blocking messages), Zenoh would close the connection. Our use case involves some long-running Zenoh sessions, and we believe the gossip subsystem was remembering all the Zenoh peers that the network had ever seen.

The only mechanism I see for peers to get removed from the gossip subsystem is in gossip::Network::remove_link(), which can be called in close_face(), but if I understand correctly, that could only remove directly connected peers and not peers heard indirectly. It seems that even with multihop disabled, peers still pass indirectly heard IDs, just not their locators (based on propagate_locators()). This explains why the OAM message became so large despite multihop being off.

Possible solutions might be to clear indirectly heard IDs, perhaps with a time-based expiration, or to not gossip anything about indirectly heard peers when multihop is disabled.

To reproduce

Create two Zenoh nodes A and B doing peer-to-peer multicast discovery with gossip enabled and multihop disabled. Their IDs should be random rather than fixed. Alternate between restarting A and B, allowing discovery to succeed after each restart. Monitor the number of peer IDs in the OAM message. It should grow indefinitely.

System info

  • Ubuntu 22.04
  • Zenoh 1.4.0

william-swarmbotics avatar Aug 11 '25 21:08 william-swarmbotics

I am only able to reproduce this with linkstate peers (i.e. with routing.peer.mode = "linkstate") and not with peer-to-peer peers. Does this correspond to your use case?

fuzzypixelz avatar Aug 14 '25 12:08 fuzzypixelz

We had routing.peer.mode left at default, which I believe means it was peer-to-peer. I don't actually know what the difference is or why it would only reproduce for linkstate.

We may have occasionally had a router active, if that could have contributed. Most of the time it was just peers though. Heads up, I wrote the "To reproduce" steps as the simplest way to reproduce based on my understanding of the issue, but I did not have time to verify that they worked. We only saw this issue in our larger system.

william-swarmbotics avatar Aug 21 '25 20:08 william-swarmbotics

I will try to reproduce with a router in the background. Let us know if you identify a minimal reproducible example.

fuzzypixelz avatar Aug 22 '25 09:08 fuzzypixelz