antrea
antrea copied to clipboard
Agent fails to re-connect to OVS
Describe the bug
Hi,
We noticed such an issue in Antrea, if OVS is working with high CPU load, Antrea Agent may fails to re-connect to OVS after a disconnection.
Having some offline debug, I think this is because the following events: The first connection is disconnected by OVS because either echo request or echo reply message is not received on the unix domain socket. Agent (ofnet) receives the disconnection event and initiates the 2nd connect. OVS may actively disconnect the 2nd connection if OVS fails to receive the hello message ( Agent tries to send it, but the message is not consumed from the connection yet). And since Agent (ofnet) is blocking at sending the message, it missed the 2nd disconnection event, so Agent does not initiate the 3rd connection.
For ovs-vswitchd.log, we can get such logs to show the events:
2022-07-17T16:52:52.175Z|09261|rconn|ERR|br-int<->unix#2: no response to inactivity probe after 60 seconds, disconnecting
2022-07-17T16:53:37.429Z|09262|rconn|INFO|br-int<->unix#3: connection timed out
To Reproduce
The OVS logs are got from a setup that CPU load is high. I can not reproduce a K8s Node to let OVS run with so high CPU load. But the OVS logs can be produced by hacking ofnet code to let ofnet not send echo request/reply message in the first connection, and not send Hello message in the second connection.
Since the messages are sent to the unix domain socket, and the existing code in ofnet does not have timeout settings. So we could get the assumption that OVS is not received the messages because ofnet does not succeed to send the messages.
Expected
Agent is expected to re-connect to OVS any time it finds the connection is broken.
Actual behavior
Agent does not connect to OVS even though OVS is working well eventually.
Versions:
Antrea: v0.13+
Additional context