celery-prometheus-exporter icon indicating copy to clipboard operation
celery-prometheus-exporter copied to clipboard

celery_tasks_by_name duplicates celery_tasks with additional label (name)

Open javabrett opened this issue 7 years ago • 10 comments

I'm seeing two issues with celery_tasks_by_name.

  1. It goes against the naming recommendation:

Do not put the label names in the metric name, as this introduces redundancy and will cause confusion if the respective labels are aggregated away.

  1. It duplicates celery_tasks, which is a Gauge with label [state], adding label (task) [name]. My (basic) understanding of the Prometheus dimensional model is that this shouldn't be necessary - name could be added to celery_tasks and we can still rollup/aggregate by either state or name severally as well as jointly.

So I can already operate on celery_tasks_by_name, aggregating by state and I get the same values as for celery_tasks, but with _by_name still in the metric-name ... highlighting both issues above.

sum(celery_tasks_by_name) by (state)

Element            | Value
------------------ | -----
{state="FAILURE"}  | 1
{state="RECEIVED"} | 0
{state="STARTED"}  | 0
{state="SUCCESS"}  | 7

... produces the same as:

celery_tasks
Element	                           | Value
---------------------------------- | -----
celery_tasks{...,state="FAILURE"}  | 1
celery_tasks{...,state="RECEIVED"} | 0
celery_tasks{...,state="STARTED"}  | 0
celery_tasks{...,state="SUCCESS"}  | 7

Obviously if these were consolidated that would be a non-backward-compatible change, so might either justify a switch (either new or old default) or a major version-number bump and release-notes.

Interested in feedback on whether the above looks right. Initially I accepted that the state-alone metric might be for efficiency, but I don't know that it is needed for that.

javabrett avatar Aug 16 '18 00:08 javabrett

Issue 1: That makes sense.
Issue 2: If the name label was added to the "celery_tasks" metric, then we can achieve the current celery_tasks by aggregation, as you have correctly stated. The main issue I see here is efficiency (as you have brought up).

celery_tasks_by_name, as it currently stands, creates many time series. For projects with many Celery Tasks, each with 5 states (retry was omitted in your example), aggregating to get that state aggregation may be costly (disclosure, I haven't tested your aggregation query with many celery tasks yet).

However, referencing this doc: https://prometheus.io/docs/prometheus/latest/querying/basics/ in the "Avoiding slow queries and overloads" section, the expensive query (if it does turn out to be expensive) can be mitigated using a recording rule.

I personally like having celery_tasks and celery_tasks_by_name separated. Maybe it's because I'm used to it. It just makes it easier to get a global view of celery task state without having to aggregate.

Plus, when we get the queue name attached, then the celery_tasks metric would be nice to quickly see how many tasks in each queue rather than having to filter them.

miguelHx avatar Aug 17 '18 20:08 miguelHx

Right, I can see how having the very low dimensional task-count by State is convenient to have separately, and at a tiny cost. And on that basis I don't mind keeping it either.

I'm interested though in where we will add the labels/dimensions for queue, exchange, routing_key etc. per #31 . If these are added to celery_tasks, it spoils the above argument of the simple and high-performance task count by State (only). It would also prevent any analysis of intersections with task Name, since that is only available in celery_tasks_by_name. And adding additional labels to that metric adds to aforementioned the naming-confusion.

Also in #33 I've proposed to start collecting runtime, a type of start-end exec duration for a successful task. In a second commit c9879687550ea62992383c81189da5d5b34be3bc I changed the proposed metric name from celery_tasks_runtime_by_name to celery_tasks_runtime. That metric could also end-up with the additional labels being proposed.

I don't know whether to expect any high-dimentionality from any of these - the ones in #31 I suppose are scale-out related? For some systems task name might I guess have 10s-100s of values ... unless task name can be dynamic.

Maybe a new metrics celery_tasks_count could replace a deprecated celery_tasks_by_name, and collect all the new dimensions as well as name?

javabrett avatar Aug 18 '18 07:08 javabrett

I'm interested though in where we will add the labels/dimensions for queue, exchange, routing_key etc. per #31 . If these are added to celery_tasks, it spoils the above argument of the simple and high-performance task count by State (only). It would also prevent any analysis of intersections with task Name, since that is only available in celery_tasks_by_name

Yeah, the analysis of intersections with task name will be useful.

I don't know whether to expect any high-dimentionality from any of these - the ones in #31 I suppose are scale-out related? For some systems task name might I guess have 10s-100s of values ... unless task name can be dynamic.

I don't think the queue labels will have as high a cardinality as the name labels. According to http://docs.celeryproject.org/en/latest/userguide/tasks.html#names, it doesn't look like the task names could be dynamic. Not dynamic in the context of while the program is running, at least.

Even if there are 100s of unique "name" labels yielding 100s of values, it is still useful to be able to do some analysis on specific celery tasks at the cost of (potentially) many value outputs.

Maybe a new metrics celery_tasks_count could replace a deprecated celery_tasks_by_name, and collect all the new dimensions as well as name?

We could do either a celery_tasks_count or celery_tasks_total. I'm also thinking celery_tasks_count_total but that sounds redundant lol. Yes the new dimension would be name to start. After this, we can approach the queue information as dimensions.

Now I'm thinking celery_tasks_total. From the prometheus metric conventions here: https://github.com/prometheus/docs/blob/master/content/docs/practices/naming.md, it states that a metric...

...should have a suffix describing the unit, in plural form. Note that an accumulating count has total as a suffix, in addition to the unit if applicable.

I see that the current celery_tasks metric is of type GAUGE. Should we change to a COUNTER instead?

If we do end up condensing celery_tasks_by_name into celery_tasks_total/count, then I think we should update the README.md with some useful PromQL queries such as the one you have in above.

miguelHx avatar Aug 22 '18 07:08 miguelHx

So what you're proposing is deprecating celery_tasks_by_name and celery_tasks in favor or celery_tasks_count which in turn should have a state and name dimension?

zerok avatar Aug 22 '18 08:08 zerok

I see that the current celery_tasks metric is of type GAUGE. Should we change to a COUNTER instead?

No. https://prometheus.io/docs/concepts/metric_types/#counter

A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart.

The current celery_tasks metric is the count of tasks in a particular state (some of which are terminal). So the count for certain states can go up then down again. So a Gauge is appropriate. States RECEIVED and STARTED should tend back to zero count. If the celery_tasks metric counted number of tasks ever created, then it should be a Counter, but the state dimension/label changes that.

javabrett avatar Aug 22 '18 11:08 javabrett

@zerok Yes, I believe that is what @javabrett is proposing.

miguelHx avatar Aug 28 '18 23:08 miguelHx

and @javabrett 👍

miguelHx avatar Aug 28 '18 23:08 miguelHx

I should get around to publishing a summary table of current/proposed/deprecated metrics, and incorporating that with the proposed celery_tasks_runtime over in #33.

Here's a first shot at that:

metric type labels status comments
celery_workers gauge [] existing Number of alive workers.
celery_tasks gauge state existing Current tasks by state.
Gauge not counter, counts go up and down as tasks move between states.
Retaining this with single state label as it is a cheap, effective metric.
celery_tasks_by_name gauge state,name existing/deprecated Task count by task name and state.
This metric is deprecated because it has labels in the metric name.
celery_tasks_total gauge state,name,queue,exchange,routing_key proposed #31 Need to capture task-sent event for this. Replaces celery_tasks_by_name.
celery_tasks_runtime histogram name proposed #33 Currently only name label, but propose to add queue, exchange and routing_key when available from #31.

javabrett avatar Aug 28 '18 23:08 javabrett

@zerok added a table above.

I propose retaining celery_tasks indefinitely, since it is cheap and simple and tasks-by-state is useful to have quickly without an aggregation-query. It will return the same as the proposed celery_tasks_total when grouped-by (state).

I propose deprecating celery_tasks_by_name unmodified, but retaining it for whatever is the required deprecation-period. New labels for the same count metric will be added instead to celery_tasks_total.

Actually, need to check if the _total suffix is OK, since this is a gauge not a counter metric.

javabrett avatar Aug 29 '18 03:08 javabrett

I'd be +1 for for _total eventhough it is a gauge as this would be the same behaviour as with a histogram, wouldn't it? +1 on all your proposals 😄

zerok avatar Sep 23 '18 20:09 zerok