linstor-server
linstor-server copied to clipboard
Loss of connection on overloaded nodes
Hi I was already mentioned this problem in https://github.com/LINBIT/linstor-server/issues/141#issuecomment-661453358, we have many nodes in the single cluster, and sometimes some of them might be overloaded.
They are flapping between Online and OFFLINE state, some of them might stay OFFLINE until the linstor-controller restart, this cause fancy problems like https://github.com/LINBIT/linstor-server/issues/186 and https://github.com/piraeusdatastore/linstor-csi/issues/89
After linstor-controller restart all the nodes become to Online
and stay in this state for a while.
Same issue here. k8s 1.18.9 drbd 9.0.25
There are known problems in the reconnector that prevent the controller from reconnecting automatically. Once the affected parts are redesigned and reimplemented, it should be able to reliably reconnect automatically. However, that will not solve the flapping on overloaded systems. If a system is so overloaded that it cannot answer requests in time, it is considered lost, which causes the connection to be dropped.
Today we faced with this problem again, many resources were blocked by fact that they were trying to reach the node which was marked as OFFLINE
:
╭────────────────────────────────────────────────────────╮
┊ Node ┊ NodeType ┊ Addresses ┊ State ┊
╞════════════════════════════════════════════════════════╡
┊ m8c24 ┊ SATELLITE ┊ 10.36.129.114:3367 (SSL) ┊ OFFLINE ┊
╰────────────────────────────────────────────────────────╯
╭──────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞══════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ one-vm-9423-disk-0 ┊ m11c7 ┊ 55986 ┊ Unused ┊ Ok ┊ UpToDate ┊ ┊
┊ one-vm-9423-disk-0 ┊ m13c8 ┊ 55986 ┊ Unused ┊ Connecting(m8c24) ┊ Diskless ┊ 2020-11-13 08:03:35 ┊
┊ one-vm-9423-disk-0 ┊ m8c24 ┊ 55986 ┊ ┊ ┊ Unknown ┊ ┊
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ one-vm-9217-disk-0 ┊ m15c19 ┊ 55799 ┊ Unused ┊ Connecting(m8c24) ┊ Diskless ┊ 2020-11-13 08:01:47 ┊
┊ one-vm-9217-disk-0 ┊ m15c22 ┊ 55799 ┊ Unused ┊ Ok ┊ UpToDate ┊ ┊
┊ one-vm-9217-disk-0 ┊ m8c24 ┊ 55799 ┊ ┊ ┊ Unknown ┊ ┊
┊ one-vm-9217-disk-0 ┊ m8c9 ┊ 55799 ┊ Unused ┊ Ok ┊ Diskless ┊ 2020-11-13 07:20:26 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────╯
this node was after reboot
ErrorReports.tar.gz linstor-controller.log linstor-satellite.log
the last message of satellite log:
07:46:51.973 [SSLNetComService] ERROR LINSTOR/Satellite - SYSTEM - Unhandled IllegalStateException [Report number 5FAE330B-93455-000176]
that was me trying to test the connection, using:
telnet 10.36.129.114 3367
restart of the linstor-controller turned node back online, but not for the long time, now it is OFFLINE
again
ErrorReports2.tar.gz linstor-controller2.log linstor-satellite2.log
If I try to restart just satellite, I see that linstor-controller not even try to reconnect it:
Ah my bad, it seems last two cases we really have some connectivity issues with the node
Today was exactly same situation, the node had lack of RAM, after the reboot the controller didn't try to reconnect the node until restart
╭───────────────────────────────────────────────────────╮
┊ Node ┊ NodeType ┊ Addresses ┊ State ┊
╞═══════════════════════════════════════════════════════╡
┊ m7c29 ┊ SATELLITE ┊ 10.36.129.74:3367 (SSL) ┊ OFFLINE ┊
╰───────────────────────────────────────────────────────╯
@kvaps have you tried setting --kube-reserved on kubelet? It suppose to help to avoid overloading
@AntonSmolkov Yes, it might solve the problem with the resources on the node, but it will not solve the fact that controller not tries to reconnect disconnected satellites.
Looks like it is controller issue, I see the following picture quite often, many nodes changing their state to 'Connected', and only controller restart make it works again.