kubedirector icon indicating copy to clipboard operation
kubedirector copied to clipboard

Ezml 1809 upgrade notifications on cluster/role levels

Open Kosta91 opened this issue 2 years ago • 1 comments

Kosta91 avatar Sep 22 '22 11:09 Kosta91

This is really good! The only thing I'm still dubious about here is the error case in RunConfigScript when a pod member is down. As the code currently stands the pod will lose that notification forever, in the case of role and cluster level notifications.

For those role/cluster notifications we may need to make use of the "notification queueing" mechanism to ensure that those notifications don't get lost. I.e. instead of immediately sending them to to pod, instead we can append them to the member's PendingNotifyCmds. You can have a grep in the code to see how those are handled, but the gist is that currently notifications are placed on this queue using queueNotify in members.go, and then the queued notifies are processed by syncMemberNotifies in members.go. syncMemberNotifies is called on every cluster reconciliation pass, so if a particular notification doesn't get successfully taken off the queue (because a member is down) it will get retried later.

A few considerations if we use that mechanism:

  • Currently queueNotify assumes that it is only used for addnodes/delnodes events. To make it useful for other kinds of events it would need to be modified (or, quite possibly, a different similar function created -- whichever is cleaner).
  • syncMemberNotifies currently takes care of advancing LastSetupGeneration to match LastConfigDataGeneration when it sends a member a notify, because up until now it is dealing with notifies that are reactions to cluster reconfigurations. In this case that should be a NOP, i.e. the LastSetupGeneration==LastConfigDataGeneration already, so it's probably fine.
  • I'm wondering how this affects the state cleanup in handleClusterUpgrade. It does seem like we should not finish the upgrade/rollback - i.e. removing the app reference and clearing the upgradeinfo -- until we know all the notifications have been sent (all the notify queues are empty). Otherwise someone could start reconfiguring the cluster while there are still upgrade notifies that need to be sent.

Let me know your thoughts about all that.

joel-bluedata avatar Sep 22 '22 21:09 joel-bluedata

You mentioned that queued notifies weren't getting sent.

I think you're being bitten by this problem: https://www.darrenlester.com/blog/go-range-gotcha

At the end of updateRoleUpgradeStatus for example, and in handleClusterUpgrade. Specifically when you get "member" from the range and then modify some field of member, you're not really modifying the original member object.

joel-bluedata avatar Oct 27 '22 18:10 joel-bluedata

You mentioned that queued notifies weren't getting sent.

I think you're being bitten by this problem: https://www.darrenlester.com/blog/go-range-gotcha

At the end of updateRoleUpgradeStatus for example, and in handleClusterUpgrade. Specifically when you get "member" from the range and then modify some field of member, you're not really modifying the original member object.

Sure, I already got it)

Kosta91 avatar Oct 31 '22 12:10 Kosta91

You still have that loop at the end of QueueNotify though. I think the start of this discussion was how to avoid doing that loop by passing in the member state detail as an argument.

joel-bluedata avatar Nov 18 '22 00:11 joel-bluedata

You still have that loop at the end of QueueNotify though. I think the start of this discussion was how to avoid doing that loop by passing in the member state detail as an argument.

@joel-bluedata , fixed

Kosta91 avatar Nov 28 '22 22:11 Kosta91