airflow-exporter
airflow-exporter copied to clipboard
Dag and task metrics should be initialized to zero at startup
Airflow metrics don't get reset after a restart, however, the metrics did not get initialized. This lead to some unexpected PromQL responses when querying with missing data.
For example, a task state 'failed' is set to '1' at the first failure of the task but before the failure no data existed for the task with state 'failed'. A PromQL query that checks if the task at least executed once over a time period using the 'increase' function, based on either 'success' or 'failed' state count increase over that time period, responded as if neither state changed over the period of time because the 'increase' function extrapolates the value that is available over the time period if there is no data.
Prometheus documentation discusses about this issue:
- https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics
A potential fix for this issue is to initialize all dag and their task metrics to zero at startup.
A workaround here:
sum(increase(airflow_task_status{status="failed"}[10m])) without (pod,instance) > 0 or max without(pod, instance) (airflow_task_status{status="failed"} != 0 unless airflow_task_status{status="failed"} offset 10m)
reference: https://github.com/prometheus/prometheus/issues/1673
A caveat with the workaround is that the exporter provides a total count of past failures, so when you first start the exporter (or if there's a sufficiently long interruption in metrics), when the exporter comes up everything that failed in the past will show new failures. So, zero initialization would be superior.