framework can fail to register because the mesos master thinks it has already completed
The framework stores an ID in etcd that it can use for re-registration upon failover. During dev/testing I've experienced scenarios in which the master thinks the framework has completed, but etcd still has a framework ID stored -- this results in the framework attempting to register with that ID and the master rejecting it.
To remove the ID from etcd (and get a new one from the master):
$ curl http://$servicehost:4001/v1/keys/mesos/k8sm/frameworkid -XDELETE
I've encountered this error at times that have surprised me, so there may be something else buggy going on.
I've also added code in the scheduler service that deletes the framework ID from etcd when failoverTimeout == 0
probably related to a bug in the pure bindings:
- https://github.com/mesos/mesos-go/issues/109
see https://github.com/mesosphere/etcd-mesos/pull/1/files#diff-9667cfa1ad6b8b445794f7c2469f069fR342
this complicates testing when using HA etcd beause reinstalling the framework on DCOS after a prior uninstall breaks horribly
perhaps when we detect this condition we can pause the scheduler for a bit instead of flapping quickly/constantly
perhaps we should store the frameworkID in ZK (or at least have the option to) since on DCOS we have an exhibitor UI that makes it simpler to delete an old frameworkID after package uninstall.
with the merges of changes related to #759 we now support storing framework-id in ZK, which can ease the uninstall pain since user's can navigate the exhibitor UI directly and delete the k8sm key