morph icon indicating copy to clipboard operation
morph copied to clipboard

Monitor queue size

Open henare opened this issue 8 years ago • 14 comments

We've got a big queue again (280 items) and I didn't know about it. Can we do this in New Relic easily?

henare avatar Apr 19 '16 10:04 henare

@henare which queue are we talking about? I assume a Sidekiq one?

If so, is it addressed by the new graphs?

auxesis avatar Feb 01 '17 04:02 auxesis

Yes, the scraper queue. From my comment about it also sounds like I meant alerting too. Maybe we don't need that?

henare avatar Feb 01 '17 04:02 henare

Is anyone going to respond to the alert? If not, don't set one up – it'll just be noise.

auxesis avatar Feb 01 '17 04:02 auxesis

If so, is it addressed by the new graphs?

For future readers, this is a screenshot of the Sidekiq graphs I'm talking about:

image

auxesis avatar Feb 01 '17 04:02 auxesis

Is anyone going to respond to the alert? If not, don't set one up – it'll just be noise.

OP has the rationale:

We've got a big queue again (280 items) and I didn't know about it.

But I guess we get a Honeybadger notification when there's no more slots for the first time anyway now?

henare avatar Feb 01 '17 04:02 henare

Yep.

Choose one way of being notified, and use that.

If recovery is not time critical, then an email is an OK place for the alert to go and die.

auxesis avatar Feb 01 '17 04:02 auxesis

If so, is it addressed by the new graphs?

@auxesis I don't think so, because it's actually retries that we're most interested in the current configuration. When Henare describes the "We've got a big queue again (280 items)" that would be 280 jobs in the retries queue.

I think it would be most useful to have one chart for scheduled jobs so we can see if that's not draining properly. And one that shows retries ( and maybe busy in there as well?) so we can see if there's a big backlog.

equivalentideas avatar Feb 01 '17 04:02 equivalentideas

I think it would be most useful to have one chart for scheduled jobs so we can see if that's not draining properly. And one that shows retries ( and maybe busy in there as well?) so we can see if there's a big backlog.

Gotcha @equivalentideas. I think the collectd-sidekiq-plugin needs some work so it displays stats from the retry queues properly.

Let's leave this open, but feel free to assign to me @henare / @equivalentideas.

auxesis avatar Feb 01 '17 04:02 auxesis

I think the collectd-sidekiq-plugin needs some work so it displays stats from the retry queues properly.

Per @equivalentideas's feedback, I've added support in collectd-sidekiq-plugin to collect retries, scheduled, dead, and number of processes.

This is deployed onto the Morph graphs:

image

image

I've also added a panel that calculates the current percentage of jobs that are failed vs succeeded:

image

The colour thresholding is set to go orange on > 50, and red on > 80.

auxesis avatar Feb 02 '17 11:02 auxesis

Here's the retries, dead, scheduled graph with a log(10) scale on the y axis:

image

This makes it a bit easier to see changes to retries at the same time as the number of scheduled jobs.

auxesis avatar Feb 02 '17 19:02 auxesis

Here's the retries, dead, scheduled graph with a log(10) scale on the y axis:

screen shot 2017-02-06 at 4 24 36 pm

This is suiting my use well.

I recon it should span the whole way across to match the load chart above.

It's much more important than the 'processed/failed' chart that it's currently next to, because that chart doesn't really move in a noticeable way on production (there's over 2 million jobs so you can't see slight changes in ratio).

equivalentideas avatar Feb 06 '17 05:02 equivalentideas

I recon it should span the whole way across to match the load chart above.

@equivalentideas how about this?

image

auxesis avatar Feb 06 '17 06:02 auxesis

Nice 👍

equivalentideas avatar Feb 07 '17 02:02 equivalentideas

The charts are now really useful for seeing if something is going wrong, but we still don't have the notifying aspect in place.

@auxesis Whenever who've got time to add it, I think to close this issue we set an alert (like our disc space one) into our #morph slack channel, when the retry queue goes over 50 or the enqueued queue goes over 50 .

This is based on looking at our wonderful charts http://graphs.morph.io/dashboard/db/morph?from=now-6M&to=now and seeing what an incident when the queue stops working looks like.

equivalentideas avatar May 22 '17 00:05 equivalentideas