Singularity icon indicating copy to clipboard operation
Singularity copied to clipboard

Document inactivate/freeze/decommission procedures

Open bmerry opened this issue 5 years ago • 2 comments

This may be that I'm not using things right: I can't find any document that explains the differences between inactivating, freezing and decommissioning a host.

If I go through the following steps:

  1. Mark a host inactive (via POST /api/inactive).
  2. Stop the mesos-agent on it.
  3. Start a new instance of mesos-agent on it (I'm using a Docker container to run Mesos, so I think it gets a new slave ID, but I'm not 100% sure).
  4. Mark the host active again (via DELETE /api/inactive).

Then the slave remains in the decommissioned state and won't run any tasks.

My goal is to be able to prevent new tasks running on a slave (so that once existing tasks die we can reboot/do maintenance on it - we use only on-demand tasks with finite lifetime), and later allow tasks to run on it again (possibly after doing maintenance on it). I've been using "inactive" rather than "freeze" because the former API works on hostnames, which means it can be set even if the mesos-agent isn't running at the time. But let me know what you advise for that.

bmerry avatar Jul 20 '20 14:07 bmerry

so, inactive was something we created to deal with some ec2 impairment cases. We would frequently have some cases whee a host went impaired, came back, went impaired, and cycled like that. The inactive marker was meant to make it so that anything coming in with that host name will be automatically marked as decommissioned, to save tasks from being launched on an impaired/cycling host like that. The reactive here essentially just removes it from a 'blocked' list of hosts

Other definitions:

  • Freeze - don't launch new tasks on a host, but leave any that are already running alone
  • Decommission - don't launch new tasks on a host, and also move any that are currently running on the host elsewhere

If just using decommission, since it is done by slave id, the new agent coming into the cluster with a new id will be in the active state. To clean up any that are in that inactive + decommissioned state you mentioned, can remove them from inactive list first, then 'reactivate' in the UI. We can update docs to make this clearer

ssalinas avatar Jul 20 '20 14:07 ssalinas

Thanks for the quick response. I've updated the title to indicate that docs should be improved, rather than anything necessarily changed.

To clean up any that are in that inactive + decommissioned state you mentioned, can remove them from inactive list first, then 'reactivate' in the UI.

I've give that a try (with the API, since I'm writing a command-line tool).

bmerry avatar Jul 20 '20 17:07 bmerry