morph
morph copied to clipboard
Monitor queue size
We've got a big queue again (280 items) and I didn't know about it. Can we do this in New Relic easily?
@henare which queue are we talking about? I assume a Sidekiq one?
If so, is it addressed by the new graphs?
Yes, the scraper queue. From my comment about it also sounds like I meant alerting too. Maybe we don't need that?
Is anyone going to respond to the alert? If not, don't set one up – it'll just be noise.
If so, is it addressed by the new graphs?
For future readers, this is a screenshot of the Sidekiq graphs I'm talking about:
Is anyone going to respond to the alert? If not, don't set one up – it'll just be noise.
OP has the rationale:
We've got a big queue again (280 items) and I didn't know about it.
But I guess we get a Honeybadger notification when there's no more slots for the first time anyway now?
Yep.
Choose one way of being notified, and use that.
If recovery is not time critical, then an email is an OK place for the alert to go and die.
If so, is it addressed by the new graphs?
@auxesis I don't think so, because it's actually retries that we're most interested in the current configuration. When Henare describes the "We've got a big queue again (280 items)" that would be 280 jobs in the retries queue.
I think it would be most useful to have one chart for scheduled jobs so we can see if that's not draining properly. And one that shows retries ( and maybe busy in there as well?) so we can see if there's a big backlog.
I think it would be most useful to have one chart for scheduled jobs so we can see if that's not draining properly. And one that shows retries ( and maybe busy in there as well?) so we can see if there's a big backlog.
Gotcha @equivalentideas. I think the collectd-sidekiq-plugin needs some work so it displays stats from the retry queues properly.
Let's leave this open, but feel free to assign to me @henare / @equivalentideas.
I think the collectd-sidekiq-plugin needs some work so it displays stats from the retry queues properly.
Per @equivalentideas's feedback, I've added support in collectd-sidekiq-plugin to collect retries, scheduled, dead, and number of processes.
This is deployed onto the Morph graphs:
I've also added a panel that calculates the current percentage of jobs that are failed vs succeeded:
The colour thresholding is set to go orange on > 50, and red on > 80.
Here's the retries, dead, scheduled graph with a log(10) scale on the y axis:
This makes it a bit easier to see changes to retries at the same time as the number of scheduled jobs.
Here's the retries, dead, scheduled graph with a log(10) scale on the y axis:
This is suiting my use well.
I recon it should span the whole way across to match the load chart above.
It's much more important than the 'processed/failed' chart that it's currently next to, because that chart doesn't really move in a noticeable way on production (there's over 2 million jobs so you can't see slight changes in ratio).
I recon it should span the whole way across to match the load chart above.
@equivalentideas how about this?
Nice 👍
The charts are now really useful for seeing if something is going wrong, but we still don't have the notifying aspect in place.
@auxesis Whenever who've got time to add it, I think to close this issue we set an alert (like our disc space one) into our #morph slack channel, when the retry queue goes over 50 or the enqueued queue goes over 50 .
This is based on looking at our wonderful charts http://graphs.morph.io/dashboard/db/morph?from=now-6M&to=now and seeing what an incident when the queue stops working looks like.