Failed to connect to bluechi-controller due to error "Connection reset by peer".
Describe the bug
Sometimes, when bluechi-agent disconnects from bluechi-controller and bluechi-agent tries to connect to another bluechi-controller, bluechi-agent may not be able to register to bluechi-controller.
In the case, bluechi-controller prints the following error:
Failed to start peer bus server: Connection reset by peer
Also, when this happens, there will be a lot of CLOSE_WAIT state on port 842 of the host of the newly connected bluechi-controller.
root@vc-s32g2:~# netstat -anv | grep 842
tcp 0 1 192.168.16.114:36072 192.168.16.114:842 SYN_SENT
tcp6 4097 0 :::842 :::* LISTEN
tcp6 30 0 192.168.16.114:842 192.168.16.105:51876 CLOSE_WAIT
tcp6 30 0 192.168.16.114:842 192.168.16.106:46946 CLOSE_WAIT
tcp6 30 0 192.168.16.114:842 192.168.16.103:34736 CLOSE_WAIT
tcp6 30 0 192.168.16.114:842 192.168.16.111:53606 CLOSE_WAIT
tcp6 30 0 192.168.16.114:842 192.168.16.114:54694 CLOSE_WAIT
tcp6 30 0 192.168.16.114:842 192.168.16.103:34426 CLOSE_WAIT
tcp6 30 0 192.168.16.114:842 192.168.16.106:50000 CLOSE_WAIT
tcp6 30 0 192.168.16.114:842 192.168.16.106:36524 CLOSE_WAIT
tcp6 30 0 192.168.16.114:842 192.168.16.106:33140 CLOSE_WAIT
tcp6 30 0 192.168.16.114:842 192.168.16.111:52734 CLOSE_WAIT
tcp6 30 0 192.168.16.114:842 192.168.16.106:51806 CLOSE_WAIT
tcp6 30 0 192.168.16.114:842 192.168.16.103:34468 CLOSE_WAIT
tcp6 30 0 192.168.16.114:842 192.168.16.105:51866 CLOSE_WAIT
...
To Reproduce
It's not easy to reproduce because this case occurs rarely.
Prerequisite
- Bluechi version is 0.9.0.
- There are many hosts (about 3 ~ 7) on local network. All hosts are running bluechi-controller and bluechi-agent.
Step
- The bluechi-agent of all hosts have connected to bluechi-controller of one host.
- Shutdown bluechi-controller of the connected host.```systemctl stop bluechi-controller
systemctl stop bluechi-controller
- Change controller address of bluechi-agent of all hosts to another host.
dbus-send --system --dest=org.eclipse.bluechi.Agent --print-reply --type=method_call /org/eclipse/bluechi org.eclipse.bluechi.Agent.SwitchController string:'tcp:host=192.168.16.111,port=842'
- Make sure that the bluechi-agent on all hosts is connected to the bluechi-controller on another host.
- Repeat the above steps until this error case happens.
Logs when the error case occurs,
- Host A cannot connect to bluechi-controller of another host.
Feb 03 11:02:29 vc-s32g2 bluechi-agent[1066]: Disconnected from controller
Feb 03 11:02:29 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:29 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:29 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 1)
Feb 03 11:02:29 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:30 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:30 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 2)
Feb 03 11:02:30 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:30 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:30 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 3)
Feb 03 11:02:30 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:31 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:31 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 4)
Feb 03 11:02:31 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Disconnected from controller
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.111,port=842
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 5)
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.111,port=842
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 6)
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.111,port=842
Feb 03 11:02:33 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:33 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 7)
Feb 03 11:02:33 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.111,port=842
...
- Host B can connect to bluechi-controller of new host.
Feb 03 11:02:29 hpc-ad-qc bluechi-agent[1173]: Disconnected from controller
Feb 03 11:02:29 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:29 hpc-ad-qc bluechi-agent[1173]: Agent connection attempt failed, retrying
Feb 03 11:02:29 hpc-ad-qc bluechi-agent[1173]: Trying to connect to controller (try 1)
Feb 03 11:02:29 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:30 hpc-ad-qc bluechi-agent[1173]: Agent connection attempt failed, retrying
Feb 03 11:02:30 hpc-ad-qc bluechi-agent[1173]: Trying to connect to controller (try 2)
Feb 03 11:02:30 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:30 hpc-ad-qc bluechi-agent[1173]: Agent connection attempt failed, retrying
Feb 03 11:02:30 hpc-ad-qc bluechi-agent[1173]: Trying to connect to controller (try 3)
Feb 03 11:02:30 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:31 hpc-ad-qc bluechi-agent[1173]: Agent connection attempt failed, retrying
Feb 03 11:02:31 hpc-ad-qc bluechi-agent[1173]: Trying to connect to controller (try 4)
Feb 03 11:02:31 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:31 hpc-ad-qc bluechi-agent[1173]: Agent connection attempt failed, retrying
Feb 03 11:02:31 hpc-ad-qc bluechi-agent[1173]: Trying to connect to controller (try 5)
Feb 03 11:02:31 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:32 hpc-ad-qc bluechi-agent[1173]: Disconnected from controller
Feb 03 11:02:32 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.111,port=842
Feb 03 11:02:32 hpc-ad-qc bluechi-agent[1173]: Register call response received
Feb 03 11:02:32 hpc-ad-qc bluechi-agent[1173]: Connected to controller as 'hpc-ad-qc2'
- Host B bluechi-controller doesn't permit that bluechi-agent of host A connects
root@hpc-ad-qc:~# journalctl -f -u bluechi-controller
Jul 01 09:00:03 hpc-ad-qc systemd[1]: Started BlueChi Controller systemd service.
Jul 01 09:00:03 hpc-ad-qc bluechi-controller[998]: Starting bluechi-controller 0.9.0-1
Feb 03 11:02:32 hpc-ad-qc bluechi-controller[998]: Registered managed node from fd 10 as 'hpc-ad-qc2'
Feb 03 11:02:32 hpc-ad-qc bluechi-controller[998]: Failed to start peer bus server: Connection reset by peer
Expected behavior
When bluechi-agent disconnects from bluechi-controller and bluechi-agent tries to connect to another bluechi-controller, the bluechi-agent of all hosts should be connected to bluechi-controller of other host without any error.
Hi @dofmind,
Based on your description it seems that the bluechi-agent is actively disconnecting due to the heartbeat timeout. The smaller the heartbeat values (interval and threshold), the more brittle the system gets in terms of timing. Could you post the used configuration of the bluechi-controller and bluechi-agent when this happened? I'll try to reproduce the issue.
It also seems that we don't clean up the TCP socket properly in this edge case. I'll have a closer look at this.
I used HeartbeatInterval and NodeHeartbeatThreshold configuration for the controller and agent.
[bluechi-controller]
HeartbeatInterval=500
NodeHeartbeatThreshold=2500
[bluechi-agent]
HeartbeatInterval=500
ControllerHeartbeatThreshold=2500
After reverting the following commit as a workaround, this issue has not occurred yet.
- The commit 71be67d4cb15 ("Make Agent Register request asynchronous)