broker
broker copied to clipboard
Better diagnostics when connecting to incompatible peers
Thanks to @ckreibich for bringing this up.
It seems like the new connector does not distinguish between recoverable errors (remote side does not exist) and fatal errors (incompatible peer / version) when trying to connect to a peer and always tries to reconnect for the configured time interval.
@ckreibich also had the idea to put in a different error code. Currently, an incompatible peer (e.g. pre-ALM-merge Zeek) uses the peer_unavailable error code. Using a new error code like peer_incompatible or peer_unsupported would provide a bit more context to the user. However, currently these codes directly map to Zeek events. If we don't want to touch that as a result, an alternative could be not touch the error codes but at least provide a better error description. Currently, Broker says "unable to connect to remote peer" while it actually did connect but found an incompatible peer.
Yeah, better diagnostics make sense.
But I think I wouldn't change the reconnect behavior: in some settings, it could be difficult to restart both sides at the same time. Say, for external clients: maybe the client needs a new Zeek, and Zeek does eventually get updated; if the client weren't trying to reconnect regularly, it would need a manual restart after the Zeek upgrade (maybe mood eventually with WebSocket, but still).
This all sounds good to me, thanks folks. I was just wondering what to do about the zeek.py test and I think we can simply make it a bit more precise: if you pass a 0 sec retry interval to Broker::peer() there is no retry, and I noticed that the Broker::error event does trigger for the one failed attempt. So if we add a handler for this event and make it check the error condition, we can exit the zeek.py test then. (It does eventually time out atm, but that's via cmake, not in the test itself.)