kubernetes-mesos icon indicating copy to clipboard operation
kubernetes-mesos copied to clipboard

framework can fail to register because the mesos master thinks it has already completed

Open jdef opened this issue 10 years ago • 7 comments

The framework stores an ID in etcd that it can use for re-registration upon failover. During dev/testing I've experienced scenarios in which the master thinks the framework has completed, but etcd still has a framework ID stored -- this results in the framework attempting to register with that ID and the master rejecting it.

To remove the ID from etcd (and get a new one from the master):

$ curl http://$servicehost:4001/v1/keys/mesos/k8sm/frameworkid -XDELETE

I've encountered this error at times that have surprised me, so there may be something else buggy going on.

jdef avatar Mar 10 '15 15:03 jdef

I've also added code in the scheduler service that deletes the framework ID from etcd when failoverTimeout == 0

jdef avatar Mar 11 '15 15:03 jdef

probably related to a bug in the pure bindings:

  • https://github.com/mesos/mesos-go/issues/109

jdef avatar Apr 09 '15 22:04 jdef

see https://github.com/mesosphere/etcd-mesos/pull/1/files#diff-9667cfa1ad6b8b445794f7c2469f069fR342

jdef avatar Jun 18 '15 12:06 jdef

this complicates testing when using HA etcd beause reinstalling the framework on DCOS after a prior uninstall breaks horribly

jdef avatar Jan 29 '16 01:01 jdef

perhaps when we detect this condition we can pause the scheduler for a bit instead of flapping quickly/constantly

jdef avatar Jan 29 '16 02:01 jdef

perhaps we should store the frameworkID in ZK (or at least have the option to) since on DCOS we have an exhibitor UI that makes it simpler to delete an old frameworkID after package uninstall.

jdef avatar Jan 30 '16 16:01 jdef

with the merges of changes related to #759 we now support storing framework-id in ZK, which can ease the uninstall pain since user's can navigate the exhibitor UI directly and delete the k8sm key

jdef avatar Feb 16 '16 18:02 jdef