airflow
airflow copied to clipboard
Missing Metrics After Migrating from StatsD to OpenTelemetry
Apache Airflow version
2.9.3
If "Other Airflow 2 version" selected, which one?
No response
What happened?
I’ve recently migrated from StatsD to OpenTelemetry and am now sending metrics to Splunk Observability.
However, I’ve noticed that some metrics are now missing after this change, such as operator_successes, operator_failures, and some custom metrics that I used to see when using StatsD with the Python stats package in my DAGs.
The common factor among the missing metrics is that their descriptions in the Airflow documentation mention “Metric with xyz tagging.” I’m not sure if this is relevant.
What you think should happen instead?
All metrics, including operator_successes, operator_failures, and custom metrics, should be available as they were with StatsD.
How to reproduce
- Migrate from StatsD to OpenTelemetry.
- Send metrics to Splunk Observability.
- Observe the missing metrics such as
operator_successes,operator_failures, andcustom metrics.
Operating System
Azure AKS linux
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
No response
Anything else?
No response
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
cc: @howardyoo
Got the same thing on 2.10.2, but without migration from statsd. Tested locally - metrics in statsd are present, in otel many are missing, especially counters (ti_failures, ti_successed, zombies_killed etc.) The pattern with presence of tag breakdown from @seifrajhi seems to be relevant, but there are other losses, such as operator_failures_<operator_name>, for example.
Let me know if you need more information about the configuration, I'd be happy to help
Got the same thing on 2.10.2, but without migration from statsd. Tested locally - metrics in statsd are present, in otel many are missing, especially counters (
ti_failures,ti_successed,zombies_killedetc.) The pattern with presence of tag breakdown from @seifrajhi seems to be relevant, but there are other losses, such asoperator_failures_<operator_name>, for example.Let me know if you need more information about the configuration, I'd be happy to help
Just a question, are these metrics consistently missing, or randomly missing? Also, what were the names of the metrics? any examples? This could either be due to the metrics name being too longe and getting truncated, or if Airflow did not have enough time to emit the metrics before something terminated.
Just a question, are these metrics consistently missing, or randomly missing? Also, what were the names of the metrics? any examples? This could either be due to the metrics name being too longe and getting truncated, or if Airflow did not have enough time to emit the metrics before something terminated.
I'll list a few with behaviors
ti_successesis missing completely;ti.startis missing completely;ti.finishis missing completely;operator_successes,_failuresare missing in any form;executor_running_tasksdoes not catch fast executing tasks;job_startis present as it is, not in<job_name>_startformat and is always equal to 1, and_endis not present at all.
I tested locally on dag with one task, which is executed once a minute, stably successful. Just in case I attach screenshots of all airflow metrics I see in Prometheus
@ferruzzi , could you take a look at this? This is related to the issue that some of the metrics that stats is generating is not appearing when using OTEL.
@dannyl1u is looking into it
Can confirm I've been able to reproduce this bug in version 3.0.0
@ferruzzi and I had a good look into this. We suspect the bug is caused by a new resource in SafeOtelLogger being created each time Stats.incr() etc. is called
https://github.com/apache/airflow/blob/5a0272c272e412e133c2031d237d05cf12a783ef/airflow/metrics/otel_logger.py#L402
Here Resource.create(attributes={HOST_NAME: get_hostname(), SERVICE_NAME: service_name}) is called, but we should be checking if a resource has already been created and to use the already created resource, instead of creating a new one each time (which "overwrites" the previously created resource)
@howardyoo would like to know your opinion on this
((That last comment only applies to custom metrics in the DAG not getting emitted and would not explain "core" metrics being dropped))
Hmm.. the Resource.create(..) should not be recreated every time there is an increase of counter, changes in gauges, or histograms. Resource attributes are meant to have the same lifespan of the tracer provider itself.
Maybe a good way is to refactor the code such that resource attributes to be declared as global singleton and every get otel logger function is not recreating that all the time, since the resource attributes are const and should not alter.
@howardyoo @ferruzzi
https://github.com/apache/airflow/pull/44268/files
I made get_otel_logger return a singleton instance (as above) of SafeOtelLogger, tried rerunning it and same bug still occurs. Is this what you meant?
In that case, if the resource is not getting recreated, perhaps the cause of this bug is still not known..?Sent from my iPhoneOn Nov 21, 2024, at 3:48 PM, Danny Liu @.***> wrote: @howardyoo @ferruzzi https://github.com/apache/airflow/pull/44268/files I made get_otel_logger return a singleton instance (as above) of SafeOtelLogger, tried rerunning it and same bug still occurs. Is this what you meant?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
Any progress with this issue?
Any progress with this issue?
You're welcome to ask for assignment and try to resolve it :) Airflow is based on contributions from the open community, so if currently no one is actively working on it - there will be no progress.
Hi all, I wanted to switch our Airflow to use OTEL and run into the same issue. I debugged into the issue and found:
-
metrics like "ti.start" and "ti.finish" are exported in the worker context with label dag_id and task_id. The metrics are only available during the time where the task is running. Looks metrics are gone after task finished and then the metric is removed after ~ 5min from the OtelCollector. Maybe because the metrics like "ti.start" are exported in the worker context an the OtelLogger is gone if the worker task is finished?
-
I´m not sure if singleton instance can solve the issue as Airflow is able to handle multiple workers, schedulers and .... For me it looks like the metrics which have the same labels are overwriting each other if they are exported by two different Pods (E.g. 2 workers). So not possible to increase a counter like "ti.start" with label "dag_id" and "task_id" if 2 dag_runs are running in parallel on different workers. Expecting: airflow_ti_start{dag_id="dag1", task_id="task1"} 2 but getting: airflow_ti_start{dag_id="dag1", task_id="task1"} 1
Two main issue to solve:
- Any idea how to make the OtelLogger static for one worker pod as the tasks are executed e.g. with the StandardTaskRunner, but then we still have to solve 2)?
- How to get trust able metrics in multi Pod deployments.
I write here to start the discussion again and get also some feedback from you all to improve the thinks. Not sure about the best way to fix this issue. We need to solve both points to get usable metrics via Otel.
Started already to improve Airflow Otel implementation a different point https://github.com/apache/airflow/pull/46510
Just adding more details to the issue:
- metrics like "ti.start" and "ti.finish" are exported in the worker context with label dag_id and task_id. The metrics are only available during the time where the task is running. Looks metrics are gone after task finished and then the metric is removed after ~ 5min from the OtelCollector. Maybe because the metrics like "ti.start" are exported in the worker context an the OtelLogger is gone if the worker task is finished?
I have tested this out, seems like this is the issue for some of the missing metrics.
I have a simple hello world task running with a PythonOperator. When running the task normally, no metric on a task level was appearing in the otel-collector logs and in my prometheus.
As a hello world task runs almost instantly there is no time for the otel-collector to grab this metrics and expose it. I added manually a sleep of 1 minute and then the metrics "ti.start" started to appear in both otel-collector logs and prometheus. The metrics of instance "ti.finish" were also correctly initialized with zero. The problem is: we will never be able to get the final task state metric updated because once the task is finalized the metrics are gone without giving time to otel-collector to grab and export those.
Not sure how the fix for this would look like.
I believe that's what was currently happening. In order to perhaps fix this issue, we might want to explore to jettison (or flush) out the metrics when the operator finishes, so that when the operation finishes, those metrics are not lost.
Another comment is that perhaps rather than monitoring when the task has started and finished using metrics, a better approach could be to monitor them using traces, since Airflow's trace implementation does not have this issue, and is much more natural way of monitoring DAG runs (this may not be relevant to the thread, but I've always believed traces are better way to observe something running)
On Tue, Feb 25, 2025 at 3:40 AM Paulo Andre de Oliveira Carneiro < @.***> wrote:
Just adding more details to the issue:
- metrics like "ti.start" and "ti.finish" are exported in the worker context with label dag_id and task_id. The metrics are only available during the time where the task is running. Looks metrics are gone after task finished and then the metric is removed after ~ 5min from the OtelCollector. Maybe because the metrics like "ti.start" are exported in the worker context an the OtelLogger is gone if the worker task is finished?
I have tested this out, seems like this is the issue for some of the missing metrics.
I have a simple hello world task running with a PythonOperator. When running the task normally, no metric on a task level was appearing in the otel-collector logs and in my prometheus.
As a hello world task runs almost instantly there is no time for the otel-collector to grab this metrics and expose it. I added manually a sleep of 1 minute and then the metrics "ti.start" started to appear in both otel-collector logs and prometheus. The metrics of instance "ti.finish" were also correctly initialized with zero https://github.com/apache/airflow/blob/42406162cac1d5f899dd22c91d9caa06b19956e5/airflow/models/taskinstance.py#L246. The problem is: we will never be able to get the final task state metric updated because once the task is finalized the metrics are gone without giving time to otel-collector to grab and export those.
Not sure how the fix for this would look like.
— Reply to this email directly, view it on GitHub https://github.com/apache/airflow/issues/41822#issuecomment-2681345475, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHZNLLWHRCQCNHKPWACJJCT2RQ3CTAVCNFSM6AAAAABNH7AQCOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOBRGM2DKNBXGU . You are receiving this because you were mentioned.Message ID: @.***> [image: PauloCarneiro99]PauloCarneiro99 left a comment (apache/airflow#41822) https://github.com/apache/airflow/issues/41822#issuecomment-2681345475
Just adding more details to the issue:
- metrics like "ti.start" and "ti.finish" are exported in the worker context with label dag_id and task_id. The metrics are only available during the time where the task is running. Looks metrics are gone after task finished and then the metric is removed after ~ 5min from the OtelCollector. Maybe because the metrics like "ti.start" are exported in the worker context an the OtelLogger is gone if the worker task is finished?
I have tested this out, seems like this is the issue for some of the missing metrics.
I have a simple hello world task running with a PythonOperator. When running the task normally, no metric on a task level was appearing in the otel-collector logs and in my prometheus.
As a hello world task runs almost instantly there is no time for the otel-collector to grab this metrics and expose it. I added manually a sleep of 1 minute and then the metrics "ti.start" started to appear in both otel-collector logs and prometheus. The metrics of instance "ti.finish" were also correctly initialized with zero https://github.com/apache/airflow/blob/42406162cac1d5f899dd22c91d9caa06b19956e5/airflow/models/taskinstance.py#L246. The problem is: we will never be able to get the final task state metric updated because once the task is finalized the metrics are gone without giving time to otel-collector to grab and export those.
Not sure how the fix for this would look like.
— Reply to this email directly, view it on GitHub https://github.com/apache/airflow/issues/41822#issuecomment-2681345475, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHZNLLWHRCQCNHKPWACJJCT2RQ3CTAVCNFSM6AAAAABNH7AQCOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOBRGM2DKNBXGU . You are receiving this because you were mentioned.Message ID: @.***>
rather than monitoring when the task has started and finished using metrics, a better approach could be to monitor them using traces
@howardyoo mhm, switching all metrics to traces would be a huge re-write in the core and be a breaking change towards StatsD. I think this is not a easy fix and would change the strategy completely. Reading the comments I rather fear that we have a general conceptual problem is metrics are "just not working". And for Airflow we need to assume we have a distributed environment. Relying on singletons is... how are other applications implementing this? Airflow can not be the first using Otel?
Airflow recently release support for Otel traces, which emits trace in otel, if you haven’t tried. It does not replace the metrics but adds additional details on how dags are running, so might be an option.Sent from my iPhoneOn Feb 27, 2025, at 3:36 PM, Jens Scheffler @.***> wrote:
rather than monitoring when the task has started and finished using metrics, a better approach could be to monitor them using traces
@howardyoo mhm, switching all metrics to traces would be a huge re-write in the core and be a breaking change towards StatsD. I think this is not a easy fix and would change the strategy completely. Reading the comments I rather fear that we have a general conceptual problem is metrics are "just not working". And for Airflow we need to assume we have a distributed environment. Relying on singletons is... how are other applications implementing this? Airflow can not be the first using Otel?—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
jscheffl left a comment (apache/airflow#41822)
rather than monitoring when the task has started and finished using metrics, a better approach could be to monitor them using traces
@howardyoo mhm, switching all metrics to traces would be a huge re-write in the core and be a breaking change towards StatsD. I think this is not a easy fix and would change the strategy completely. Reading the comments I rather fear that we have a general conceptual problem is metrics are "just not working". And for Airflow we need to assume we have a distributed environment. Relying on singletons is... how are other applications implementing this? Airflow can not be the first using Otel?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
I made a personal workaround so that people who already migrated to Otel could use this until it is officially fixed.
Cause of issue (as I understand)
Flushing from Airflow processes to OpenTelemetry collector relies on periodic flush (default 60s). There is no guaranteed force flush at the end of the process.
Scheduler or Celery worker processes are longlived, so they have no problem. However, task instance processes (subprocess of the worker process) are ephemeral and usually spans shorter than the periodic flush, so they might never flush once.
How to verify this is the cause?
I tested with this dummy DAG. if you activate sleep(), you can see the metrics in otelcol; if you remove it, you mostly cannot.
from airflow.stats import Stats
def dummy():
Stats.gauge("dummy_gauge", 1, {"tag1": "value1", "tag2": "value2"})
Stats.incr("dummy_incr", 1, {"tag3": "value3", "tag4": "value4"})
# sleep(70)
with DAG(...):
_ = PythonOperator(
python_callable=dummy,
task_id="dummy",
)
Possible Workaround
Add a force-flush callback on every task instance.
# DAG(default_args=default_args, ...)
default_args = {
"on_success_callback": flush_metrics,
"on_failure_callback": flush_metrics,
"on_retry_callback": flush_metrics,
"on_skipped_callback": flush_metrics,
}
def flush_metrics(context: dict[str, Any]) -> None:
def flush(provider: Any) -> None:
if provider and hasattr(provider, "force_flush"):
provider.force_flush()
flush(metrics.get_meter_provider())
flush(getattr(metrics.get_meter_provider(), "_real_meter_provider", None))
Same function should be also available via plugins.event_listener. (didn't try it by myself) https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/example_dags/plugins/event_listener/index.html#module-contents
@xBis7 -> maybe you could take a look at that ? I think that might be nice continuation of your improvements on OTEL ?
@potiuk Thanks for pinging me. I have a few patches related to metrics that I'm planning to contribute. This is interesting, I'll add it to my backlog.
Hello everyone, I've started looking into this.
I'm doing some testing with both statsd and OTel and so far I've found a lot of metrics that are missing from both. But I haven't found any that are present in statsd and missing from OTel.
It looks like it has to do with isolated and short-lived processes like the one running a task. Metrics from these don't get exported.
Update: I found metrics missing just from OTel and I can confirm what previous comments have already stated.
Yeah. thought so.
Hello, everyone! We have noticed the same issue as reported here since June/25. I passed by only to add to the discussion, not that we could do something (won't be upgrading Airflow versions anytime soon).
We use Airflow 2.5.1 on k8s with statsd enabled for Datadog, but just some are registered. the "Airflow did not have enough time to emit the metrics before something terminated" hypothesis is strong and makes sense; is it an actual issue?
In our case, we have legacy Airflow DAGs that are roughly composed by boto3 EMR interactions for 1. cluster creations and 2. steps additions. Our working metrics' list differs from @Ibraitas' one up there. For the ones that arrive there (e.g., ti.finish), if the task fails fast enough, we don't get the measurement with state=failed, so our monitor end up not noticing the failure.
As we intend to migrate those DAGs in the next few months to another completely different thing, using long running tasks and sensors attached, it will mitigate the fail fast issue. I can come here to tell whether it fixed our issue or not. Still, I'd like to leave the question if the "fail fast" problem is known and which Airflow version fixes it.
Thank you!
@paulochf You probably don't see all the metrics with statsd because 2.5.1 is a very old version. I think the earliest supported version at the moment that is also regularly tested for backwards compatibility, is 2.10.X.
I've been testing in 3.1.X and I don't see any problems with statsd. The issue is with OTel because the metrics are exported in batches and some subprocesses end so fast that there isn't enough time to flush them. statsd on the other hand, exports the metrics the moment that they are captured.
I would suggest for you to upgrade soon. The differences in the architecture are so many that it's like we are discussing different projects.