satellite icon indicating copy to clipboard operation
satellite copied to clipboard

Consider a way to automatically re-enable black hole hosts

Open mforsyth opened this issue 9 years ago • 6 comments

It would be nice if somehow, manual intervention weren't always required to bring a host back out of its black hole status and get it re-added to the whitelist.

Need to think about a specific strategy for this.

mforsyth avatar Feb 24 '16 21:02 mforsyth

@cmilloy You are well situated to think of how this could work in a way that would be most helpful. Let's use this comment stream as a sounding board for proposals/ideas. @wkf I know you suggested that blackhole exile could simply be temporary; after a configurable period of time, we start sending a black hole host tasks again as a trial run.

mforsyth avatar Feb 25 '16 19:02 mforsyth

I'm not sold on this. I think if a host is blackholed, it should be disabled until someone, or something, addresses the issues. We don't want to overload satellite with concerns .

On Thu, Feb 25, 2016 at 2:26 PM Matthew Forsyth [email protected] wrote:

@cmilloy https://github.com/cmilloy You are well situated to think of how this could work in a way that would be most helpful. Let's use this comment stream as a sounding board for proposals/ideas. @wkf https://github.com/wkf I know you suggested that blackhole exile could simply be temporary; after a configurable period of time, we start sending a black hole host tasks again as a trial run.

— Reply to this email directly or view it on GitHub https://github.com/twosigma/satellite/issues/58#issuecomment-188941422.

tnn1t1s avatar Feb 25 '16 20:02 tnn1t1s

After socializing this internally we have come up with a few ideas for implementation to start:

  • As you noted, being able to set an expiration when a host is black holed would be one. It would also be nice to have some way to increment the black hole duration upon recurrence.
  • Adding a test/canary job that can still target the de-whitelisted hosts and will re-whitelist if the job is successful
  • Allow some arbitrary command to be run on the slave after de-whitelisting the host. The command may try to remediate obvious problems and reboot the host and then re-whitelist after reboot (with a configurable # of retries). Perhaps this is decoupled from de-whitelisting and takes the form of a completely separate comet each node runs to see if it has been de-whitelisted?

@tnn1t1s We can certainly discuss further. This primarily came from the prediction that some issues which cause mesos task failures will not originate inside the cluster (such as those caused by infrastructure). They will occur and be fixed independently of the support team(s) operating the cluster. The concern is that without automatic re-whitelisting such outages will cause unnecessary work and dependency on the support team(s) who are operating the cluster to resume service to users.

I think it makes sense for satellite to have a facility for automatic re-whitelisting which is configurable enough to apply to multiple use-cases. If we decide not to use it, that's OK too.

cmilloy avatar Feb 26 '16 21:02 cmilloy

@corey - i think black hole host detection only occurs on 'task-lost', not 'task-failed'. If that's correct, massive job failures due to downstream dependencies e.g. my database is down, should not trigger this. Instead, it would catch events like, 'this host can't start jobs', 'this host has a broken mesos-agent'. If we're worried, the black-hole host detector doesn't actually have to take the action of adding to blacklist. For starters, we can just start notifying on this event.

That said, I'd like to keep this very simple and not overload satellite with concerns. A few of your suggestions require run-on-host semantics that don't exist in a Mesos world. We'd rather not try to invent that here.

As an alternative, I can imagine a configurable callback that is triggered on blacklist events (regardless of 'how' e.g. black hole host-detector, or, some other check). This callback could call arbitrary code, of which examples may be 'reboot host and remove from whitelist'.

Maybe, a process outside of Satellite can monitor the blacklist and try remedial action as per your suggestion, but I wouldn't want to try to add that to Satellite and if it were my ops team, i wouldn't use it. This could be left as an opinionated effort by individual ops teams.

On Fri, Feb 26, 2016 at 4:40 PM, cmilloy [email protected] wrote:

After socializing this internally we have come up with a few ideas for implementation to start:

  • As you noted, being able to set an expiration when a host is black holed would be one. It would also be nice to have some way to increment the black hole duration upon recurrence.
  • Adding a test/canary job that can still target the de-whitelisted hosts and will re-whitelist if the job is successful
  • Allow some arbitrary command to be run on the slave after de-whitelisting the host. The command may try to remediate obvious problems and reboot the host and then re-whitelist after reboot (with a configurable

    of retries). Perhaps this is decoupled from de-whitelisting and takes the

    form of a completely separate comet each node runs to see if it has been de-whitelisted?

@tnn1t1s https://github.com/tnn1t1s We can certainly discuss further. This primarily came from the prediction that some issues which cause mesos task failures will not originate inside the cluster (such as those caused by infrastructure). They will occur and be fixed independently of the support team(s) operating the cluster. The concern is that without automatic re-whitelisting such outages will cause unnecessary work and dependency on the support team(s) who are operating the cluster to resume service to users.

I think it makes sense for satellite to have a facility for automatic re-whitelisting which is configurable enough to apply to multiple use-cases. If we decide not to use it, that's OK too.

— Reply to this email directly or view it on GitHub https://github.com/twosigma/satellite/issues/58#issuecomment-189491828.

tnn1t1s avatar Feb 27 '16 05:02 tnn1t1s

@tnn1t1s the Black hole detector does actually care about failed tasks (not lost tasks).

I really like the idea of, as a first step, just having the black hole detector alert admins, rather than removing the host from the whitelist. That has some advantages:

  1. It lets us introduce black hole detection in a way where it can only help, not harm, thus allowing the maintenance team (who are anxious about the idea of it possibly causing more work than it saves) to see when it kicks in before having to trust it to actually take hosts out of the rotation.
  2. It's very clear functionality, and allows us to avoid (for now) what promises to be a lengthly negotiation for the functionality of both both this issue and https://github.com/twosigma/satellite/issues/57.
  3. I believe it will basically be a one-line change to the config to make this change, meaning that we don't have to divert resources from Cook right now.

mforsyth avatar Feb 27 '16 11:02 mforsyth

This seems like the way to go. We can alert and collect the data and build confidence and understanding of the behavior before trying to design a solution

On Sat, Feb 27, 2016 at 6:54 AM Matthew Forsyth [email protected] wrote:

@tnn1t1s https://github.com/tnn1t1s the Black hole detector does actually care about failed tasks (not lost tasks).

I really like the idea of, as a first step, just having the black hole detector alert admins, rather than removing the host from the whitelist. That has some advantages:

  1. It lets us introduce black hole detection in a way where it can only help, not harm, thus allowing the maintenance team (who are anxious about the idea of it possibly causing more work than it saves) to see when it kicks in before having to trust it to actually take hosts out of the rotation.
  2. It's very clear functionality, and allows us to avoid (for now) what promises to be a lengthly negotiation for the functionality of both both this issue and #57 https://github.com/twosigma/satellite/issues/57.
  3. I believe it will basically be a one-line change to the config to make this change, meaning that we don't have to divert resources from Cook right now.

— Reply to this email directly or view it on GitHub https://github.com/twosigma/satellite/issues/58#issuecomment-189621863.

tnn1t1s avatar Feb 27 '16 19:02 tnn1t1s