cloud-provider icon indicating copy to clipboard operation
cloud-provider copied to clipboard

kube-controller-manager -> cloud-controller-manager HA migration: KEP + alpha implementation

Open andrewsykim opened this issue 6 years ago • 14 comments

We need a KEP outlining how we intend to migrate existing clusters from using the kube-controller-manager to the cloud-controller-manager for the cloud provider specific parts of Kubernetes.

At KubeCON NA 2018, we discussed grouping the existing cloud controllers under 1 leader election that is shared by the kube-controller-manager and the cloud-controller-manager. For single node control planes this is not needed, but for HA control planes we need a mechanism to ensure that not more than 1 kube-controller-manager or cloud-controller-manager is running the set of cloud controllers in a cluster.

andrewsykim avatar Feb 19 '19 20:02 andrewsykim

/assign @mcrute

@mcrute is working on the initial design for this.

andrewsykim avatar Feb 19 '19 20:02 andrewsykim

cc @cheftako

andrewsykim avatar Feb 19 '19 20:02 andrewsykim

Here's a first draft. There's plenty more to be done but getting this out there for discussion.

mcrute avatar Mar 01 '19 18:03 mcrute

Thanks @mcrute!

andrewsykim avatar Mar 01 '19 18:03 andrewsykim

Thanks @mcrute https://github.com/mcrute! I would like us to also discuss as part of this how we do a better job of running Controllers in HA environments. Currently we do not utilize HA well as part of this. If we could get rid of the kill process when leader election is lost, then we could get much better utilization in HA. The problem has been that Controllers tend to kick of goroutines (and similar asynchronous processing). The problem is that Controller actions may not be idempotent. So we end up with mutations from something other than the main controller thread which did not get shut down (or at least shut down in a timely manner). One thought for this could be to attach an election token (or similar) to mutations. If the mutator is no longer leader, then the write is refused and the mutator is notified that they are no longer the leader (and should stop). While I believe is more than we need for the KCM->CCM migration, I would like us to consider it as where we are going. It would be good for us to make sure we are generally heading in that direction.

On Fri, Mar 1, 2019 at 10:41 AM Andrew Sy Kim [email protected] wrote:

Thanks @mcrute https://github.com/mcrute!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/cloud-provider/issues/11#issuecomment-468767347, or mute the thread https://github.com/notifications/unsubscribe-auth/AA53A-drkWbYBQ3TMM_J5azU7dY8Qhoyks5vSXRtgaJpZM4bD0lq .

cheftako avatar Mar 14 '19 18:03 cheftako

/milestone v1.15 /priority critical-urgent

andrewsykim avatar Mar 20 '19 20:03 andrewsykim

/assign

andrewsykim avatar Apr 03 '19 19:04 andrewsykim

This is going to slip into the next release since we couldn't get the KEP reviewed in time for the KEP deadline. Further discussions happening for this in https://github.com/kubernetes/enhancements/pull/979 & https://github.com/kubernetes/kubernetes/pull/77878, hoping to have an implementable KEP in time for v1.16.

andrewsykim avatar May 15 '19 21:05 andrewsykim

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Sep 10 '19 20:09 fejta-bot

/remove-lifecycle stale

cheftako avatar Sep 10 '19 22:09 cheftako

/assign @yastij

andrewsykim avatar Oct 02 '19 20:10 andrewsykim

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Dec 31 '19 21:12 fejta-bot

/remove-lifecycle stale

cheftako avatar Jan 02 '20 23:01 cheftako

/lifecycle frozen

cheftako avatar Jan 02 '20 23:01 cheftako