paasta icon indicating copy to clipboard operation
paasta copied to clipboard

have the autoscaler alert if it needs more capacity than the max count

Open Rob-Johnson opened this issue 8 years ago • 4 comments

If the autoscaler would have scaled up the instance count normally, but hasn't been able to because of the max_instances limit, it would be good to send a sensu event to tell the user about it. likewise, if it would have scaled down but hasn't because of min_instances, do the same thing, but maybe only ticket rather than page?

Rob-Johnson avatar May 06 '16 13:05 Rob-Johnson

This sounds like a good idea, but I'm not sure if this is always an error condition. For example if I min: 3, then I really never want it to go below 3, even if the scaler wants me to. I wouldn't want it constantly reminding me of that decision.

Likewise with max, it could be that I really don't want it go to past 10 workers during the day, because that is all I want it to scale up for who knows why. (max ri limit, sauceconnect licenses, simply limiting costs, etc)

I would hope that other alerts will fire first. I understand in real life people may be expecting paasta to notify them if their decisions don't match reality.

solarkennedy avatar May 06 '16 19:05 solarkennedy

that's true that this won't always be an error condition, but I wonder if the 'norm' here would just be people not having the right limits set. How about a long realert_every so that it's not noisy? or even a silence_autoscaling_alerts (or similarly named) config option?

Rob-Johnson avatar May 09 '16 11:05 Rob-Johnson

So far, the alerts paasta generates are all definitively error conditions where the infrastructure couldn't do what the user asked.

In this case, the alert is more of a suggestion (you could scale down more if you wanted to / I would scale up more if you allowed me to). But in both cases paasta is still doing what it is configured to do, in my opinion this is the right level alerting for this kind of infrastructure.

I still consider this 1 vote, anyone else want to weigh in?

solarkennedy avatar May 09 '16 19:05 solarkennedy

I raised a dupe of this today internally, I'll give my two penneth:

I think for hitting max and not scaling up we should alert. This would have helped us twice this week when a service was hitting its max limit and serving errors. You could definitely argue that something else should have paged but I think if paasta "knows" you are under provisioned and hitting a limit then the service owner should get a page.

Basically there's two causes:

  • Your service is hitting the wall
  • You've not configured the CPU/mem requirements etc correctly so paasta thinks you're hitting the wall but you're bursting enough to keep the service running.

In both cases I think a page would be useful. I don't think we need one on min_instances or if we do it could just ticket.

mattmb avatar Apr 03 '17 14:04 mattmb