kubernetes-mesos framework can fail to register because the mesos master thinks it has already completed

The framework stores an ID in etcd that it can use for re-registration upon failover. During dev/testing I've experienced scenarios in which the master thinks the framework has completed, but etcd still has a framework ID stored -- this results in the framework attempting to register with that ID and the master rejecting it.

To remove the ID from etcd (and get a new one from the master):

$ curl http://$servicehost:4001/v1/keys/mesos/k8sm/frameworkid -XDELETE

I've encountered this error at times that have surprised me, so there may be something else buggy going on.

Mar 10 '15 15:03 jdef

I've also added code in the scheduler service that deletes the framework ID from etcd when failoverTimeout == 0

Mar 11 '15 15:03 jdef

probably related to a bug in the pure bindings:

https://github.com/mesos/mesos-go/issues/109

Apr 09 '15 22:04 jdef

see https://github.com/mesosphere/etcd-mesos/pull/1/files#diff-9667cfa1ad6b8b445794f7c2469f069fR342

Jun 18 '15 12:06 jdef

this complicates testing when using HA etcd beause reinstalling the framework on DCOS after a prior uninstall breaks horribly

Jan 29 '16 01:01 jdef

perhaps when we detect this condition we can pause the scheduler for a bit instead of flapping quickly/constantly

Jan 29 '16 02:01 jdef

perhaps we should store the frameworkID in ZK (or at least have the option to) since on DCOS we have an exhibitor UI that makes it simpler to delete an old frameworkID after package uninstall.

Jan 30 '16 16:01 jdef

with the merges of changes related to #759 we now support storing framework-id in ZK, which can ease the uninstall pain since user's can navigate the exhibitor UI directly and delete the k8sm key

Feb 16 '16 18:02 jdef