piraeus-operator icon indicating copy to clipboard operation
piraeus-operator copied to clipboard

Automatic restore of evicted nodes

Open kvaps opened this issue 3 years ago • 7 comments

If node come offline for 2 hours then return back it stays EVICTED even when linstor-satellite succefully started. Shouldn't we recover such nodes automatically by piraeus-operrator?

kvaps avatar Apr 13 '22 11:04 kvaps

Yeah, you are right. I think restore got added a little later after the initial evict, so this was just missed.

WanzenBug avatar Apr 14 '22 14:04 WanzenBug

Automatic restoration still does not work to me. @WanzenBug were you succeed in testing this?

kvaps avatar May 11 '22 11:05 kvaps

Yes, I did test that successfully. But let me try again....

WanzenBug avatar May 11 '22 13:05 WanzenBug

@WanzenBug sorry to disturb, didn't you check that yet? We have some users complaining on this issue

kvaps avatar Jul 11 '22 09:07 kvaps

Sorry, haven't checked it yet. Feel free to ping me again if I don't respond by next week

WanzenBug avatar Jul 14 '22 07:07 WanzenBug

@WanzenBug

Hello, I raised the test stand, started dropping the network (off/on the interface for 30s) and got a strange behaviour, the node itself does not return to online:

╭──────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node                                   ┊ NodeType   ┊ Addresses                ┊ State   ┊
╞══════════════════════════════════════════════════════════════════════════════════════════╡
┊ node-drbd-1                            ┊ SATELLITE  ┊ 192.168.0.151:3367 (SSL) ┊ Online  ┊
┊ node-drbd-2                            ┊ SATELLITE  ┊ 192.168.0.152:3367 (SSL) ┊ OFFLINE ┊
┊ node-drbd-3                            ┊ SATELLITE  ┊ 192.168.0.153:3367 (SSL) ┊ Online  ┊
┊ piraeus-cs-controller-5cf89dd5d4-dlktz ┊ CONTROLLER ┊ 10.42.2.13:3367 (SSL)    ┊ OFFLINE ┊
┊ piraeus-cs-controller-5cf89dd5d4-lwzrk ┊ CONTROLLER ┊ 10.42.0.21:3367 (SSL)    ┊ Online  ┊
╰──────────────────────────────────────────────────────────────────────────────────────────╯

last message on controller:

07:51:14.930 [SslConnector] INFO  LINSTOR/Controller - SYSTEM - Remote satellite peer /192.168.0.152:3367 has closed the connection.

the node is alive, the satellite is alive on it, with its hands it is transferred online if the reconnect is called

sergeimonakhov avatar Jul 20 '22 08:07 sergeimonakhov

@D1abloRUS This does sound like an issue that was recently fixed in LINSTOR, where the controller "forgot" to reconnect if the connection was dropped during the initial handshake. See https://github.com/LINBIT/linstor-server/commit/bbc27839ad69025b437696cd28598f1db3b80e77

Fixed version was just released, just need to release the new images for piraeus.

WanzenBug avatar Jul 20 '22 08:07 WanzenBug