bluechi icon indicating copy to clipboard operation
bluechi copied to clipboard

Failed to connect to bluechi-controller due to error "Connection reset by peer".

Open dofmind opened this issue 11 months ago • 2 comments

Describe the bug

Sometimes, when bluechi-agent disconnects from bluechi-controller and bluechi-agent tries to connect to another bluechi-controller, bluechi-agent may not be able to register to bluechi-controller.

In the case, bluechi-controller prints the following error:

Failed to start peer bus server: Connection reset by peer

Also, when this happens, there will be a lot of CLOSE_WAIT state on port 842 of the host of the newly connected bluechi-controller.

root@vc-s32g2:~# netstat -anv | grep 842             
tcp        0      1 192.168.16.114:36072    192.168.16.114:842      SYN_SENT   
tcp6    4097      0 :::842                  :::*                    LISTEN     
tcp6      30      0 192.168.16.114:842      192.168.16.105:51876    CLOSE_WAIT 
tcp6      30      0 192.168.16.114:842      192.168.16.106:46946    CLOSE_WAIT 
tcp6      30      0 192.168.16.114:842      192.168.16.103:34736    CLOSE_WAIT 
tcp6      30      0 192.168.16.114:842      192.168.16.111:53606    CLOSE_WAIT 
tcp6      30      0 192.168.16.114:842      192.168.16.114:54694    CLOSE_WAIT 
tcp6      30      0 192.168.16.114:842      192.168.16.103:34426    CLOSE_WAIT 
tcp6      30      0 192.168.16.114:842      192.168.16.106:50000    CLOSE_WAIT 
tcp6      30      0 192.168.16.114:842      192.168.16.106:36524    CLOSE_WAIT 
tcp6      30      0 192.168.16.114:842      192.168.16.106:33140    CLOSE_WAIT 
tcp6      30      0 192.168.16.114:842      192.168.16.111:52734    CLOSE_WAIT 
tcp6      30      0 192.168.16.114:842      192.168.16.106:51806    CLOSE_WAIT 
tcp6      30      0 192.168.16.114:842      192.168.16.103:34468    CLOSE_WAIT 
tcp6      30      0 192.168.16.114:842      192.168.16.105:51866    CLOSE_WAIT 
...

To Reproduce

It's not easy to reproduce because this case occurs rarely.

Prerequisite

  • Bluechi version is 0.9.0.
  • There are many hosts (about 3 ~ 7) on local network. All hosts are running bluechi-controller and bluechi-agent.

Step

  1. The bluechi-agent of all hosts have connected to bluechi-controller of one host.
  2. Shutdown bluechi-controller of the connected host.```systemctl stop bluechi-controller
systemctl stop bluechi-controller
  1. Change controller address of bluechi-agent of all hosts to another host.
dbus-send --system --dest=org.eclipse.bluechi.Agent --print-reply --type=method_call /org/eclipse/bluechi org.eclipse.bluechi.Agent.SwitchController string:'tcp:host=192.168.16.111,port=842'
  1. Make sure that the bluechi-agent on all hosts is connected to the bluechi-controller on another host.
  2. Repeat the above steps until this error case happens.

Logs when the error case occurs,

  • Host A cannot connect to bluechi-controller of another host.
Feb 03 11:02:29 vc-s32g2 bluechi-agent[1066]: Disconnected from controller
Feb 03 11:02:29 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:29 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:29 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 1)
Feb 03 11:02:29 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:30 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:30 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 2)
Feb 03 11:02:30 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:30 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:30 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 3)
Feb 03 11:02:30 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:31 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:31 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 4)
Feb 03 11:02:31 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Disconnected from controller
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.111,port=842
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 5)
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.111,port=842
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 6)
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.111,port=842
Feb 03 11:02:33 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:33 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 7)
Feb 03 11:02:33 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.111,port=842
...
  • Host B can connect to bluechi-controller of new host.
Feb 03 11:02:29 hpc-ad-qc bluechi-agent[1173]: Disconnected from controller
Feb 03 11:02:29 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:29 hpc-ad-qc bluechi-agent[1173]: Agent connection attempt failed, retrying
Feb 03 11:02:29 hpc-ad-qc bluechi-agent[1173]: Trying to connect to controller (try 1)
Feb 03 11:02:29 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:30 hpc-ad-qc bluechi-agent[1173]: Agent connection attempt failed, retrying
Feb 03 11:02:30 hpc-ad-qc bluechi-agent[1173]: Trying to connect to controller (try 2)
Feb 03 11:02:30 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:30 hpc-ad-qc bluechi-agent[1173]: Agent connection attempt failed, retrying
Feb 03 11:02:30 hpc-ad-qc bluechi-agent[1173]: Trying to connect to controller (try 3)
Feb 03 11:02:30 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:31 hpc-ad-qc bluechi-agent[1173]: Agent connection attempt failed, retrying
Feb 03 11:02:31 hpc-ad-qc bluechi-agent[1173]: Trying to connect to controller (try 4)
Feb 03 11:02:31 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:31 hpc-ad-qc bluechi-agent[1173]: Agent connection attempt failed, retrying
Feb 03 11:02:31 hpc-ad-qc bluechi-agent[1173]: Trying to connect to controller (try 5)
Feb 03 11:02:31 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:32 hpc-ad-qc bluechi-agent[1173]: Disconnected from controller
Feb 03 11:02:32 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.111,port=842
Feb 03 11:02:32 hpc-ad-qc bluechi-agent[1173]: Register call response received
Feb 03 11:02:32 hpc-ad-qc bluechi-agent[1173]: Connected to controller as 'hpc-ad-qc2'
  • Host B bluechi-controller doesn't permit that bluechi-agent of host A connects
root@hpc-ad-qc:~# journalctl -f -u bluechi-controller
Jul 01 09:00:03 hpc-ad-qc systemd[1]: Started BlueChi Controller systemd service.
Jul 01 09:00:03 hpc-ad-qc bluechi-controller[998]: Starting bluechi-controller 0.9.0-1
Feb 03 11:02:32 hpc-ad-qc bluechi-controller[998]: Registered managed node from fd 10 as 'hpc-ad-qc2'
Feb 03 11:02:32 hpc-ad-qc bluechi-controller[998]: Failed to start peer bus server: Connection reset by peer

Expected behavior

When bluechi-agent disconnects from bluechi-controller and bluechi-agent tries to connect to another bluechi-controller, the bluechi-agent of all hosts should be connected to bluechi-controller of other host without any error.

dofmind avatar Feb 04 '25 01:02 dofmind

Hi @dofmind,

Based on your description it seems that the bluechi-agent is actively disconnecting due to the heartbeat timeout. The smaller the heartbeat values (interval and threshold), the more brittle the system gets in terms of timing. Could you post the used configuration of the bluechi-controller and bluechi-agent when this happened? I'll try to reproduce the issue.

It also seems that we don't clean up the TCP socket properly in this edge case. I'll have a closer look at this.

engelmi avatar Feb 11 '25 09:02 engelmi

I used HeartbeatInterval and NodeHeartbeatThreshold configuration for the controller and agent.

[bluechi-controller]         
HeartbeatInterval=500            
NodeHeartbeatThreshold=2500

[bluechi-agent]
HeartbeatInterval=500
ControllerHeartbeatThreshold=2500

After reverting the following commit as a workaround, this issue has not occurred yet.

  • The commit 71be67d4cb15 ("Make Agent Register request asynchronous)

dofmind avatar Feb 13 '25 06:02 dofmind