rabbitmq-server
rabbitmq-server copied to clipboard
`rabbitmq-upgrade drain` exits with exit code 69 when it shouldn't
We keep seeing FailedPreStopHook when 3.9 LRE nodes get rolled, and we were able to track it down to the rabbitmq-upgrade drain command, specifically:
Exec lifecycle hook ([/bin/bash -c if [ ! -z "$(cat /etc/pod-info/skipPreStopChecks)" ]; then exit 0; fi; rabbitmq-upgrade await_online_quorum_plus_one -t 604800; rabbitmq-upgrade await_online_synchronized_mirror -t 604800; rabbitmq-up │
│ grade drain -t 604800]) for Container "rabbitmq" in Pod "rabbitmq-server-2_lre-3-9(338b4baf-0dea-4260-94a3-11a4dd2c613a)" failed - error: command '/bin/bash -c if [ ! -z "$(cat /etc/pod-info/skipPreStopChecks)" ]; then exit 0; fi; rabbitmq-upgrade await_on │
│ line_quorum_plus_one -t 604800; rabbitmq-upgrade await_online_synchronized_mirror -t 604800; rabbitmq-upgrade drain -t 604800' exited with 69: │
│ 15:16:59.517 [warn] This node is being put into maintenance (drain) mode │
│ │
│ 15:16:59.521 [warn] Suspended all listeners and will no longer accept client connections │
│ Error: │
│ {:noproc, {:gen_server, :call, [#PID<11735.5319.0>, {:shutdown, 'Node was put into maintenance mode'}, :infinity]}}
We are not sure why these are happening, but even running the drain on a completely healthy node results in the same error code, even though the failure is different:
rabbitmq@rabbitmq-server-2:/$ rabbitmq-upgrade drain ; echo $?
Will put node [email protected] into maintenance mode. The node will no longer serve any client traffic!
15:35:00.320 [warn] This node is being put into maintenance (drain) mode
15:35:00.323 [warn] Suspended all listeners and will no longer accept client connections
Error:
{:channel_termination_timeout, {:gen_server, :call, [#PID<11702.1403.0>, {:shutdown, 'Node was put into maintenance mode'}, :infinity]}}
69
Running the same command the second time seems to work correctly:
rabbitmq@rabbitmq-server-2:/$ rabbitmq-upgrade drain ; echo $?
Will put node [email protected] into maintenance mode. The node will no longer serve any client traffic!
15:35:37.564 [warn] This node is being put into maintenance (drain) mode
15:35:37.565 [warn] Suspended all listeners and will no longer accept client connections
15:35:39.500 [warn] Closed 4 local client connections
0
I am not sure whether there is an issue with the drain command, but it is obvious that this command triggers failures while stopping node, which defeats the purpose of it in the first place: it's meant to stop nodes gracefully and reliably.
I think that we should handle situations like the ones above better, because otherwise automation cannot rely on this command stopping nodes gracefully.
Originally posted by @gerhard in https://github.com/rabbitmq/opportunities/issues/99