cf-for-k8s
cf-for-k8s copied to clipboard
As an operator I would like my apps to stay online during Kubernetes upgrades
Is your feature request related to a problem? Please describe. During Kubernetes upgrades, nodes are drained which will shutdown the pod that is running CF applications before starting up on a new node. This could mean and application could go up and down for the duration of the upgrade.
Describe the solution you'd like
Some way for developers who push apps to configure a PodDisruptionBudget so that their application can stay online. It should only be possible to configure this budget when there is more than 1 instance so that upgrades can complete.
At first I was thinking CF-for-k8s could just add a budget with minimum available of 1, but on larger deployments that see a lot of traffic, that would probably cause 503's if the service got hammered on a single replica.
You should only be able to configure this if your replica count is more than 1. Configuring a minimum available when you only have a single replica will mean that upgrades cannot happen at all.
Describe alternatives you've considered There are no alternatives. You can scale your CF apps and "hope for the best" that you won't have all them unscheduled at the same time.
Additional context Add any other context or screenshots about the feature request here.
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/176243873
The labels on this github issue will be updated when the story is started.
Oh interesting, I'd expect that for production, HA (highly available) apps:
- the default number of instances (as set via CF cli) would be 2 [set by the CF user]
- and that the apps should already be fairly resilient to K8s cluster upgrades through standard K8s cluster draining and the nature of having 2+ app instances
@Birdrock @davewalter @acosta11 I'd be curious to hear y'all's input here.
I think your second point is mostly true. However, there are a few instances were apps would still go offline even with 2+ instances.
-
Imagine I have a 5 node cluster in Azure with 2 instances of a production app and I need to upgrade Kubernetes. I want the upgrade to go quickly so my AKS Surge Capacity is set to 2 nodes. If AKS chooses to drain the two nodes that contain my app, I will experience downtime and I do not get much of a choice or guarantees in Kubernetes that it will remain online during the drain.
-
Imagine I have a 5 node cluster near capacity with 2 instances on a production app and I need to upgrade Kubernetes. Kubernetes makes no guarantees about the pod scheduling but does a "best effort" to spread the distribution of pods. It is possible that a cluster near resource constraints or with complex node affinities for other apps will have the 2 instances created on the same node. Then no matter my surge capacity the Kubernetes upgrade may take my production app offline during draining.
You can see how these scenarios can happen no matter the number of instances I choose in CF and may be more likely to happen depending on the surge capacity. For instance scenario 1 is very unlikely to happen if you were to create 5 instances (since all 5 instances would have to be scheduled on just 2 nodes), but it is still possible and would be a waste of resources just to achieve HA. The second scenario is not very likely in the majority of workloads but the first is.
cf-for-k8s specifically makes this downtime more likely because it uses StatefulSets to define CF apps. This means that a pod must be completely terminated before another can be scheduled. In other deployments like ReplicaSets this is not the case. In a ReplicaSet if a pod is in a terminating state another is scheduled immediately.
The following has some useful documentation that details best practices around this:
- https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/
- https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
Curious what others perhaps with more K8S experience think though.
@braunsonm Thank you for the explanation - that makes sense. I'll take a deeper look into Eirini and see how it matches against Kubernetes recommendations. My feeling is that Kubernetes already has the machinery to handle situations like this, but we aren't properly taking advantage of it.
No problem @Birdrock I actually just made a small edit which might be relevant to your investigating with Eirini.
cf-for-k8s specifically makes this downtime more likely because it uses StatefulSets to define CF apps. This means that a pod must be completely terminated before another can be scheduled. In other deployments like ReplicaSets this is not the case. In a ReplicaSet if a pod is in a terminating state another is scheduled immediately.
Could we solve this without exposing PDBs to the end-user? I imagine that Eirini could automatically setup a PDB with min available 1 or 50% of the desired instance count, or
Absolutely agree on taking yet another look at the StatefulSet vs Deployment/ReplicaSet discussion. IIRC the only reason to make it StatefulSets was to allow for addressing individual instances. /cc @herrjulz and also cross-referencing https://github.com/cloudfoundry-incubator/eirini/issues/94
If we could deprecate routing to individual instances, I guess using Deployments could be a thing?
Big +1 from my side on @voelzmo's comment. I recall a not-so-recent call with @herrjulz where I believe I recall that indeed single instance routing was the primary reason to stick with StatefulSets.
That's true, if we deprecate routing to individual instances we could switch to Deployments instead of using Statefulsets. @bkrannich @voelzmo
@herrjulz I guess this is still parallel to the question of PDBs, isn't it?
@loewenstein yes it is parallel. As PDBs are on a per app basis, It can be considered to make Eirini create a PDB for every app when there are more than one instance.
Could we solve this without exposing PDBs to the end-user?
50% could work but it makes assumptions about the load that an app can handle. I would prefer it being user configurable.
@braunsonm one thought I had was that this setting might rather be a setting of the foundation instead of the individual app. Like, if you run an app in production on 20 instances you'll probably have some reason and likely fail to keep it available if you drop below idk 15 instances. If you didn't, why would you run 20 instances in the first place.
This might be different for staging, qa, playground systems though. In short, if you want app HA, you probably want min available >50% if you don't want HA, you might be fine without any PDB or at least with a different min percentage..
WDYT?
@loewenstein Hmm I'm confused by your reply. It makes perfect sense what you said but that's exactly why I was thinking it's better being an individual app setting vs foundational. Because I don't care about some app in my dev space going offline during an upgrade (I don't need a PDB), but my prod app I would want control over the PDB. For exactly the reason you said, some apps might be under different loads and need more available at anytime.
@braunsonm Good point. I was seeing dev foundation vs. prod foundation. With dev spaces and prod spaces in the same foundation, this is of course looking different.
I'd still prefer not to expose PDBs to app developers. They shouldn't know anything about Pods or the details of Kubernetes node update. How's this handled with Diego BTW?
@loewenstein Ah yea still I don't think that would be preferred. For instance I might have an app with two instances but still not really worry about downtime (perhaps it's a consumer for a rabbit queue and not user facing) I wouldn't want to make assumptions about what availability it needs just because some other app needs 75%.
I'd prefer not to expose PDBs either. Not sure how Diego handles this. The only thing I could think of would be a new manifest property for minAvailable or something that supports a number or percent like the PDB. Eirini then makes the PDB for us?
@loewenstein In Diego-land, an app developer doesn't need to specify anything more than instances: 2 in their app manifest (or cf push --instances 2). My understanding is that when a Diego cell needs to be shutdown for maintenance/upgrades, if all the instances of the app are on that one cell, then they will be terminated sequentially and started back up on a different cell (or cells), using the health-check defined for that app to know when the new instances are successfully migrated. Essentially, one instance of a given app is always guaranteed to be running as a result.
FYI, I'm a colleague of @braunsonm and CF-for-VMs operator, in case you're wondering where I'm coming from. 😄 In our CF-for-VMs foundations I don't think we've ever seen an app suffer total failure during a CF upgrade if it was running 2+ instances; I think this kind of behaviour is ultimately the goal that @braunsonm (and me, by extension!) are looking to have wiht cf-for-k8s.
Reading earlier comments in this thread, I came here to say what @9numbernine9 has meanwhile said: I believe that a good (IMHO: the best) default is to do what Diego does because we'll see people moving over from CF-for-VMs not expecting a behavior change when it comes to CF app scaling. If we later on want to add additional flexibility by considering something like dev vs. prod foundations this is fine, but I'd advocate for keeping the status quo first.
Re-reading @9numbernine9's comment, I'm not sure if it is suggesting to keep the exact same Diego behavior or if the suggestion is to at least keep an app instance up-and-running to be able to serve requests. As mentioned above, my strong preference would be to retain Diego behavior.
As mentioned above, my strong preference would be to retain Diego behavior.
That's why I've added the question about Diego behavior. My guess would be Diego drain makes sure all apps are properly evacuated to other cells. Getting the exact behavior could get complicated, though.
Adding in @PlamenDoychev, both for visibility but also to add comments around Diego draining behavior in case they have not been covered here already.
@bkrannich Sorry, I should've expressed myself more clearly! I don't necessarily think that emulating Diego's draining behaviour exactly should be a requirement, but providing a behaviour that keeps a subset of app instances alive during an upgrade probably should be.
In my experience, Diego's behaviour seems quite reasonable to me. If an app is deployed with instances: 4 in the manifest (e.g.), it means that the person deploying it probably wants at least one instance at all times, but they might also never want more than 4 either. (Consider scenarios where the app is constrained by outside resources, e.g. a data store that might perform significantly worse if there were more than 4 instances of the app at any time, or a database connection provided by a cloud database provider that limits the maximum number of connections that could be used at any time.)
Personally, Diego's current drain behaviour makes existing infrastructure upgrades (e.g. a CF upgrade or upgrading all the stemcells) the kind of activity that can be done during working hours without disruption to clients, whereas if we didn't have this behaviour we would need to be scheduling upgrades during off-hours - and I hate working weekends. 😆
At least one instance could cause issues under load. Or are we discussing having a PDB with minAvailable = instances-1?
Thus, minimizing the amount of app disruption
Depending on the ratio of app instances to k8s worker nodes, a PDB with min instances-1 is likely to block worker node updates I think. When instance count doubles the number of workers, an optimal spread would mean there's no worker that can be shut down.
@9numbernine9:
If an app is deployed with instances: 4 in the manifest (e.g.), it means that the person deploying it probably wants at least one instance at all times, but they might also never want more than 4 either.
So far, we have educated our users around the current Diego behavior which is that if they specify instances: 4 they express that they want 4 instances at least (and they will do the sizing of backing services accordingly). Our users (and this might be different for people who are not CF users today, have a K8s background and thus operate cf-for-k8s themselves or for people that both operate CF as well as their own apps) do not even know or care if an update is running - they care about their apps only and they expect updates to be a non-event for them. This includes the ability to keep processing the expected load they have forecasted.
As mentioned, I believe today's Diego behavior should be the default for cf-for-k8s as well (or at least there should be a system-wide option to retain this behavior) because we want people to upgrade to cf-for-k8s without too much changes (otherwise, why upgrade to cf-for-k8s and not alternatives for running stateless apps with buildpacks).
I think part of the discussion here is different which is: From a coding perspective, what options does K8s offer to achieve one or the other behavior once we have settled on a default. But I'd suggest to make this a second step which is informed by answering the question of "what do our users want"?
@loewenstein @bkrannich @9numbernine9 @braunsonm I just discussed this topic with the team and we realised that we already set PDBs, but we set it to 1 instance if the instance count is >1. Currently this is a hardcoded value.
We should find an appropriate default setting for the PDBs we create in Eirini + we should expose the parameter in Eirini to be configurable by the end-user.
@herrjulz @bkrannich
I agree that a default behaviour that matches Diego would be good, but I don't think that we should stop there. As @herrjulz said I would like to see it configurable per app. The guarantees of only a single app being available would take production workloads offline because of the increased load.
Also interesting you already default to at least 1 instance PDB! I didn't notice that before. In that case the default behaviour already is what Diego has and this issue is more about improving that since currently only having a single app be available during an upgrade would result in downtime for higher traffic apps.
With the current approach, we have one single setting for PDBs for every app. If we want to make the setting more individual (eg for every single app), the cloud controller would need to provide this information to Eirini such that Eirini can set the PDB for every app individually. This would require some work on the CC side.
Hi folks 👋, a bit off-topic but I think that to decide what's best it's good to better understand how Diego maintains zero downtime for apps during cf/infra updates. So I decided to put in a few words regarding it.
Note - To understand the details below it's good to have a basic understanding of how Diego works. It's enough to read the first few lines in this doc.
During an update bosh stops, updates and then starts all the jobs on a subset of the diego-cells. The stop procedure contains a drain step which starts an evacuation in the rep and waits until it finishes. Once triggered during it:
- the
repwaits for all actual LRPs on the cell to get deleted. - the
repmarks the cell as evacuating so that theauctioneerwon't schedule instances on it and tells thebbsto evacuate the app instances on the cell. - once an app instance is marked as evacuating the
auctioneerschedules a replica instance on another cell. - for every instance the
reppollsbbsto check whether a replica has already been created for it and if there is the evacuating instance gets deleted. - if the process times out then there is a force cleanup.
In short the behaviour is to kill an evacuating app instance only after it is replicated on another diego-cell or replicating it times out. By doing so it is "guaranteed" that even single instance apps be kept available during updates.
Hope this puts some insight into how Diego handles updates. 😄
Another note - Diego always tries to schedule instances of a single app across different cells to maintain HA.
PS0 - I'm not the author of the evacuation feature but I'm familiar with the code base since I had to research it once. PS1 - I only linked examples of evacuating running app instances since they're easier to grasp.
Hi all,
we started to work on exposing the Eirini PDB setting in the configmap and make the default (keep one instance up if the instance count is >1) configurable. While working on it we realised that this can lead to un-evictable pods in some cases. For example if the default is changed to 5 for minAvailablePods in the PDB all apps would need to be deployed with at least 5 instances.
Also, we were not able to come up with an use-case where it makes sense to change the current default for the PDBs Eirini creates. We think on a per-app-basis this makes most sense. But that would require an end-to-end change (cli, cloud controller, eirini) and requires a broader discussion. A part of that discussion would be how such an end-to-end change would affect Diego.
Another thing to point out is that we have to deal with the fact that depending on the backend (diego or k8s) a user has more/less features available.
Hi everyone!
One thing I'd like to point out is that, actually, according to @IvanHristov98's comment, our current behaviour (minAvailable: 1) is not equivalent to Diego's, which never runs fewer instances than the ones requested by the app (but sometimes will run more, e.g. during evacuations).
In Kubernetes this would be equivalent to maxUnavailable: 0, which unfortunately doesn't seem to be supported:
If you set
maxUnavailableto 0% or 0, or you setminAvailableto 100% or the number of replicas, you are requiring zero voluntary evictions. When you set zero voluntary evictions for a workload object such as ReplicaSet, then you cannot successfully drain a Node running one of those Pods. If you try to drain a Node where an unevictable Pod is running, the drain never completes. This is permitted as per the semantics ofPodDisruptionBudget.
We've going to run a spike about this to see if there's a way to achieve Diego's behaviour in Kubernetes. If we can't find any, we might settle with maxUnavailable: 1 for apps with more than 1 instance, and no PodDisruptionBudget at all for apps with one instance (very similar to what we do now, but using maxUnavailable: 1 instead of minAvailable: 1).