celery-exporter icon indicating copy to clipboard operation
celery-exporter copied to clipboard

Adding more metrics

Open Tomasz-Kluczkowski opened this issue 3 years ago • 2 comments

Hi,

I have just found this great repo right when we are adding a lot of monitoring at work (also for celery).

I would love to have a few more metrics (don't we all :P):

  • queuing time
  • task run time
  • queue length (per worker, per task name etc.). This may not be easy as each broker requires different treatment but rabbitmq/redis are most popular, could get started with those.

Our workload is strictly CPU intensive and we need to know when we are saturating the workers and need to buy more VMs.

I was thinking of trying to implement this and make a PR, it would be a damn shame just to do it for my work only :)

Would you be ok with me making a PR for this? Cannot promise anything time-wise as it is super busy now but I should get started this weekend (just get the repo and start testing my options, reading the code etc., design approach to this).

I think the main idea is to reuse what you already have and add more handlers. To measure the queueing time its simply: task started time - task received time but please correct me if I am wrong here.

We could also check the latency so: task received time - task sent time Task run time should be already present on the event itself I believe (haven't inspected them in debug mode for a while though).

Initially the metric could be by host/task name as the existing ones you have, later maybe could be combined for queues/workers if possible.

Please let me know what you think about this idea and if contribution is ok :).

Tom

Tomasz-Kluczkowski avatar Mar 19 '21 00:03 Tomasz-Kluczkowski

I have just found this great repo right when we are adding a lot of monitoring at work (also for celery).

Thank you!

I was thinking of trying to implement this and make a PR, it would be a damn shame just to do it for my work only :)

I think that's a good idea.

I think the main idea is to reuse what you already have and add more handlers.

I've made the worker as simple as possible by hooking into the Celery event system and exporting whatever events the system publishes. There is little custom logic to calculate additional metrics and that has worked for the clients I work with so far. We are mostly interested in the task failure / task success rate. We alert if a task failed to execute (above a certain % threshold), and also if a scheduled task has not executed at all in a time window.

https://docs.celeryproject.org/en/latest/userguide/monitoring.html#task-events

To measure the queueing time its simply: task started time - task received time but please correct me if I am wrong here.

Off the top of my head this would be a little tricky because the simplest approach I can think of is to make the exporter stateful and store UUIDs + timestamps for each task in memory in order to calculate the execution time. You would have to be careful that this doesn't cause a memory leak with an increasing amount of tasks.

Initially the metric could be by host/task name as the existing ones you have, later maybe could be combined for queues/workers if possible.

Hostname represents the worker name. You can set a custom one using celery worker --hostname=<name> when starting the worker.

later maybe could be combined for queues

Queue names were a little tricky to test - IIRC the in memory broker I use for tests was not exposing this, but my memory might be off. If we add this it should ideally not break for any broker types.

Please let me know what you think about this idea and if contribution is ok :).

A good idea is to adhere to the Prometheus exporters best practise here: https://prometheus.io/docs/instrumenting/writing_exporters/

Good luck!

danihodovic avatar Mar 19 '21 06:03 danihodovic

Hi, thx for a speedy reply :)

I have some preliminary results which are looking good. The code needs some refactoring as with what I add I am really breaking single responsibility principle and the track_task_event becomes a monstrosity that starts doing everything.

I will look into making it a bit neater (separate stuff into small methods etc.) before sharing my changes.

I have not yet checked if in memory broker provides queue names etc but hopefully it does. If that is the case we could safely add queue to the set of labels used by the metrics.

I attach some images to show that what I want to achieve works (for the queuing time I just use a Gauge) :).

All I have is a simple celery app with concurrency 1 and a task which sleeps for set amount of time. The tasks sent to the celery after the first one have to wait for it to finish sleeping and the queuing time goes up. The data is there and easily obtainable thx god :).

Screenshot from 2021-03-20 22-53-04

Tomasz-Kluczkowski avatar Mar 20 '21 23:03 Tomasz-Kluczkowski