kubernetes-mesos EPIC: create mechanism to update slaves with old k8sm executor

As Dan, I want to upgrade DCOS Cluster without impacting any Kubernetes application SLAs so that I can run tier-1 production containerized applications on DCOS/Kubernetes.

Idea:

the apiserver, scheduler and controller manager are restarted in a new version
the nodes keep running old executors (old = incompatible = different ExecutorId)
there shall be a way for a rolling (destructive for the running pods on that node) executor update

Prior art:

https://github.com/kubernetes/kubernetes/blob/af4788116a6e854e1f944b3257658f1cb416450e/cluster/gce/upgrade.sh

Upgrade Orchestration Idea

create a ExecutorUpgradeController (euc) in the controller manager
euc will watch nodes which are marked "incompatible" by the scheduler
euc iterates over old executor nodes (using a ListWatch filtering by the "incompatible"-annotation):
- for all pods which are owned by a rc:
  - if the rc is healthy (e.g. more than 90% of replicas are up), delete the pod
- if there are no pods left with an unhealthy rc, delete non-rc pods which are left (Possibly implement different aggressively modes: delete immediately, delete after a timeout from "incompatible"-annotation timestamp to now, delete never)
the scheduler watches old nodes without pods and sends them a kamikaze message 3 possibilities:
- send debug handler HTTP message
- send framework message (are they reliable?) – only possible by scheduler
- use KillExecutor Mesos scheduler driver method (does it exist?) – only possible by scheduler

TODOs:

[ ] scheduler: annotate nodes with old executors as "incompatible" (MVP)
- [ ] mark incompatible executors with the key k8s.mesosphere.io/incompatbile-executor and the timestamp of the annotation as the value.
send kamikaze message to "incompatible" executors without scheduled pod
[ ] investigate 3 message mechnism above (MVP)
[ ] implement chosen mechanism (MVP)
controller mgr: implement ExecutorUpgradeController
- [x] implement aggressive delete policy without any advance intelligence (MVP) https://github.com/mesosphere/kubernetes-mesos/issues/722
- [ ] implement pod timeout
- [ ] implement rc health check (MVP)
- [ ] implement configurability for upgrade aggressiveness
[ ] make sure the ExecutorId is different for different versions (or maybe even binaries hashes) (MVP)
[ ] document update procedure (MVP)

Notes:

the node controller evacuates nodes which have no status update for some time. Check whether this evacuation is reusable.
compare TEARDOWN command at http://mesos.apache.org/documentation/latest/scheduler-http-api/ which is called on dcos package uninstall according to Jose.
[ ] once we have a final plan for implementation, create a second ticket (linked to this one)

Jun 05 '15 14:06 sttts

xref #166

Nov 16 '15 22:11 jdef

FWIW joerg and kensipe were interested in talking about executor failover scenarios, which seems possibly related to this use case. may be worth setting up a meeting with them to discuss. we tried to set up something after tulum, but I was sick and the meeting was never rescheduled.

Nov 26 '15 15:11 jdef

Sound interesting. Can you ping them again and invite me as well?

Nov 26 '15 15:11 sttts

On the other hand, we will not see anything on the horizon early enough I fear.

Nov 26 '15 15:11 sttts

@jdef please take a look at our "Upgrade Orchestration Idea"

Nov 27 '15 09:11 sttts

I like the idea. Some complexities:

command line arguments, environment variables, and static pod configuration. I'm assuming that pushing all that config state into etcd is a prerequisite for this?
what about non-static, non-replication-controlled, non-daemon-controlled pods? have we decided to just kill them? or do we abort and warn the user that they need to migrate their one-off pods to something more reliable? or do we allow the user to specify a "--force-upgrade" option that will trash all such pods.
what guarantees can we make that we won't leave the cluster in a "split" state, with parts of it upgraded and other parts not?

... let's flesh out this idea some more. maybe in a google doc or .md proposal?

Nov 27 '15 17:11 jdef

About your questions:

etcd config is not really pre-requisite. As long as the ExecutorId changes on those changes, the controller is able to update a cluster. Of course, as long as we don't have dynamic config updates, this means that every config change might trigger executor restarts.

We can extends the ExecutorId hash to include static pods for that matter.
we kill those pods, yes. The assumption is that non-rc pods may be killed. We have nothing in our hands to restart them. If pods should be reliable, use a rc. That's the Kubernetes contract after all with the user.

Note though that we only kill those non-rc pods when are rc-pods has been killed. For the later we check the health of the rc.
this state is nothing bad. We want to notify the admin with events when nodes cannot be updated by killing their pods and the executor.

Nov 27 '15 17:11 sttts

Thinking about non-rc pods, one could leave them running and force the admin to decide. Depending on the environment this could even be configurable. We could even add some label/annotation to mark certain nodes/pods for that.

Nov 27 '15 17:11 sttts

Also note that the controller keeps running, i.e. keeps watching the nodes and pods with old executor ids. This means the admin can manually kill non-rc pods and the controller will just continue its work then. Feel like pretty usable and easily explainable. The events are the main user interface to control this process for the admin.

Nov 27 '15 17:11 sttts

when does the upgrade controller shut down? or is it always running?

On Fri, Nov 27, 2015 at 12:35 PM, Dr. Stefan Schimanski < [email protected]> wrote:

Also note that the controller keeps running, i.e. keeps watching the nodes and pods with old executor ids. This means the admin can manually kill non-rc pods and the controller will just continue its work then. Feel like pretty usable and easily explainable. The events are the main user interface to control this process for the admin.

— Reply to this email directly or view it on GitHub https://github.com/mesosphere/kubernetes-mesos/issues/324#issuecomment-160180634 .

Nov 27 '15 19:11 jdef

There is no communication other than the annotations done by the scheduler. So the controller keeps running, watching the nodes for that annotation. No need to shut it down.

Nov 27 '15 23:11 sttts

in fact it's pretty similar to the node controller. the node controller watches nodes and checks the status updates. If there is none for some time, it will kill all the pods and eventually delete the node.

Nov 27 '15 23:11 sttts

configuration (command line flags) are a major source of potential incompat, xref #516

Jan 04 '16 14:01 jdef

@jdef how do you see the config objects in relation to this ticket?

As far as I can judge with my limited knowledge of the config object road ahead, some of config object changes might trigger a kubelet restart, some won't but are applied in-place. The later don't lead to incompatibility. With our minion we clould support kubelet/executor restarts I guess, without the need to mark the executor as incompatible. Does this make sense?

My naive assumption is therefore that the number of command line arguments will be greatly reduced at some point, giving much less reasons to mark executors as incompatible.

Jan 04 '16 14:01 sttts

I think that moving config items from command line into objects shifts the responsibility away from mesos master and to the framework (us) in terms of checking for specific things that would break compatibility between upgrades. The current "all flags must be the same" approach that mesos master enforces is less flexible but makes our job pretty easy: if we detect diff flags we know right away that there's incompat. Once the flags are mostly gone then our responsibilities grow: compare these two config maps (and software versions, and ...) and determine if there's incompat. We can make this as naive (for simplicity) or as smart (complex) as we want to. I think we should aim for simplicity first and then address more complicated cases as needs arise.

With respect to a kubelet restart trigger ... I haven't seen any code that does this yet but I also haven't searched for it. Needless to say, we need to be aware of such shenanigans given how our minion controller would react to such.

Jan 04 '16 14:01 jdef

Have you seen the upstream plan how the kubelet will react on config changes? Will it just shutdown or will it try to apply changes in-place?

Jan 04 '16 14:01 sttts

I haven't seen anything like that proposed or implemented for the kubelet. There is a new 'runtime config' flag for the kubelet that takes a list of k=v pairs. Maybe someone is planning an external deployment manager process that will watch configmap changes and trigger kubelet restarts as approporiate?

On Mon, Jan 4, 2016 at 9:58 AM, Dr. Stefan Schimanski < [email protected]> wrote:

Have you seen the upstream plan how the kubelet will react on config changes? Will it just shutdown or will it try to apply changes in-place?

— Reply to this email directly or view it on GitHub https://github.com/mesosphere/kubernetes-mesos/issues/324#issuecomment-168699235 .

Jan 04 '16 15:01 jdef

If we can read the config object from the scheduler, pass it down to the executor and kubelet, we would be able to determine compatibility. A look at the kubelet config object PR might tell us whether we can go this route.

Jan 04 '16 15:01 sttts

Hey folks - just curious, but what's the ETA for being able to upgrade k8s-mesos in place? Without it, its essentially not possible to use it in production, right?

Jan 21 '16 16:01 timperrett

hard to provide an estimate for a delivery date at this point. lots of other things going on. conservative estimate? the upgrade feature probably isn't going to be ready before June 2016

On Thu, Jan 21, 2016 at 11:57 AM, Timothy Perrett [email protected] wrote:

Hey folks - just curious, but what's the ETA for being able to upgrade k8s-mesos in place? Without it, its essentially not possible to use it in production, right?

— Reply to this email directly or view it on GitHub https://github.com/mesosphere/kubernetes-mesos/issues/324#issuecomment-173633519 .

Jan 21 '16 17:01 jdef

Having some estimate is better than nothing, so i'll take the conservative one, thanks :-) I'd really like to run k8s on top of mesos, but clearing all the pods for an upgrade would cause a total platform outage, so without this i cant even begin to test it with production workloads etc, sadly.

Jan 21 '16 17:01 timperrett

Just curious if there is a timeline on this. I realize this isn't a small task, but it is reducing my confidence in using k8sm in production. Has work been done on this? Or, even in the short term, is there a workaround that would allow me to upgrade (or change the command-line parameters) of the kubernetes scheduler so it can still schedule pods/objects on mesos?

Sep 28 '16 22:09 haneefkassam

To be clear: at no point was k8sm advertised as production ready. AFAIK no one is working on this yet; it's an opportunity in waiting.

Sep 28 '16 22:09 jdef

Ah, ok, I was not aware of that, but this does lean me away from using kubernetes and mesos together and investing in solutions involving the two working in concert.

I do appreciate the quick reply!

Sep 29 '16 15:09 haneefkassam

kubernetes-mesos kubernetes-mesos copied to clipboard

EPIC: create mechanism to update slaves with old k8sm executor

kubernetes-mesos
kubernetes-mesos copied to clipboard