airflow icon indicating copy to clipboard operation
airflow copied to clipboard

Missing Metrics After Migrating from StatsD to OpenTelemetry

Open seifrajhi opened this issue 1 year ago • 2 comments

Apache Airflow version

2.9.3

If "Other Airflow 2 version" selected, which one?

No response

What happened?

I’ve recently migrated from StatsD to OpenTelemetry and am now sending metrics to Splunk Observability.

However, I’ve noticed that some metrics are now missing after this change, such as operator_successes, operator_failures, and some custom metrics that I used to see when using StatsD with the Python stats package in my DAGs.

The common factor among the missing metrics is that their descriptions in the Airflow documentation mention “Metric with xyz tagging.” I’m not sure if this is relevant.

What you think should happen instead?

All metrics, including operator_successes, operator_failures, and custom metrics, should be available as they were with StatsD.

How to reproduce

  • Migrate from StatsD to OpenTelemetry.
  • Send metrics to Splunk Observability.
  • Observe the missing metrics such as operator_successes, operator_failures, and custom metrics.

Operating System

Azure AKS linux

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

seifrajhi avatar Aug 28 '24 10:08 seifrajhi

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

boring-cyborg[bot] avatar Aug 28 '24 10:08 boring-cyborg[bot]

cc: @howardyoo

gopidesupavan avatar Oct 03 '24 20:10 gopidesupavan

Got the same thing on 2.10.2, but without migration from statsd. Tested locally - metrics in statsd are present, in otel many are missing, especially counters (ti_failures, ti_successed, zombies_killed etc.) The pattern with presence of tag breakdown from @seifrajhi seems to be relevant, but there are other losses, such as operator_failures_<operator_name>, for example.

Let me know if you need more information about the configuration, I'd be happy to help

Ibraitas avatar Nov 11 '24 13:11 Ibraitas

Got the same thing on 2.10.2, but without migration from statsd. Tested locally - metrics in statsd are present, in otel many are missing, especially counters (ti_failures, ti_successed, zombies_killed etc.) The pattern with presence of tag breakdown from @seifrajhi seems to be relevant, but there are other losses, such as operator_failures_<operator_name>, for example.

Let me know if you need more information about the configuration, I'd be happy to help

Just a question, are these metrics consistently missing, or randomly missing? Also, what were the names of the metrics? any examples? This could either be due to the metrics name being too longe and getting truncated, or if Airflow did not have enough time to emit the metrics before something terminated.

howardyoo avatar Nov 12 '24 02:11 howardyoo

Just a question, are these metrics consistently missing, or randomly missing? Also, what were the names of the metrics? any examples? This could either be due to the metrics name being too longe and getting truncated, or if Airflow did not have enough time to emit the metrics before something terminated.

I'll list a few with behaviors

  • ti_successes is missing completely;
  • ti.start is missing completely;
  • ti.finish is missing completely;
  • operator _successes, _failures are missing in any form;
  • executor_running_tasks does not catch fast executing tasks;
  • job_start is present as it is, not in <job_name>_start format and is always equal to 1, and _end is not present at all.

I tested locally on dag with one task, which is executed once a minute, stably successful. Just in case I attach screenshots of all airflow metrics I see in Prometheus

Screenshot 2024-11-12 at 11 38 23 Screenshot 2024-11-12 at 11 39 37 Screenshot 2024-11-12 at 11 40 13 Screenshot 2024-11-12 at 11 40 53 Screenshot 2024-11-12 at 11 41 11

Ibraitas avatar Nov 12 '24 08:11 Ibraitas

@ferruzzi , could you take a look at this? This is related to the issue that some of the metrics that stats is generating is not appearing when using OTEL.

howardyoo avatar Nov 13 '24 04:11 howardyoo

@dannyl1u is looking into it

ferruzzi avatar Nov 14 '24 18:11 ferruzzi

Can confirm I've been able to reproduce this bug in version 3.0.0

dannyl1u avatar Nov 14 '24 22:11 dannyl1u

@ferruzzi and I had a good look into this. We suspect the bug is caused by a new resource in SafeOtelLogger being created each time Stats.incr() etc. is called

https://github.com/apache/airflow/blob/5a0272c272e412e133c2031d237d05cf12a783ef/airflow/metrics/otel_logger.py#L402

Here Resource.create(attributes={HOST_NAME: get_hostname(), SERVICE_NAME: service_name}) is called, but we should be checking if a resource has already been created and to use the already created resource, instead of creating a new one each time (which "overwrites" the previously created resource)

@howardyoo would like to know your opinion on this

dannyl1u avatar Nov 21 '24 20:11 dannyl1u

((That last comment only applies to custom metrics in the DAG not getting emitted and would not explain "core" metrics being dropped))

ferruzzi avatar Nov 21 '24 20:11 ferruzzi

Hmm.. the Resource.create(..) should not be recreated every time there is an increase of counter, changes in gauges, or histograms. Resource attributes are meant to have the same lifespan of the tracer provider itself.

howardyoo avatar Nov 21 '24 21:11 howardyoo

Maybe a good way is to refactor the code such that resource attributes to be declared as global singleton and every get otel logger function is not recreating that all the time, since the resource attributes are const and should not alter.

howardyoo avatar Nov 21 '24 21:11 howardyoo

@howardyoo @ferruzzi

https://github.com/apache/airflow/pull/44268/files

I made get_otel_logger return a singleton instance (as above) of SafeOtelLogger, tried rerunning it and same bug still occurs. Is this what you meant?

dannyl1u avatar Nov 21 '24 21:11 dannyl1u

In that case, if the resource is not getting recreated, perhaps the cause of this bug is still not known..?Sent from my iPhoneOn Nov 21, 2024, at 3:48 PM, Danny Liu @.***> wrote: @howardyoo @ferruzzi https://github.com/apache/airflow/pull/44268/files I made get_otel_logger return a singleton instance (as above) of SafeOtelLogger, tried rerunning it and same bug still occurs. Is this what you meant?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

howardyoo avatar Nov 21 '24 23:11 howardyoo

Any progress with this issue?

Yoni-Weisberg avatar Jan 01 '25 13:01 Yoni-Weisberg

Any progress with this issue?

You're welcome to ask for assignment and try to resolve it :) Airflow is based on contributions from the open community, so if currently no one is actively working on it - there will be no progress.

shahar1 avatar Jan 10 '25 09:01 shahar1

Hi all, I wanted to switch our Airflow to use OTEL and run into the same issue. I debugged into the issue and found:

  1. metrics like "ti.start" and "ti.finish" are exported in the worker context with label dag_id and task_id. The metrics are only available during the time where the task is running. Looks metrics are gone after task finished and then the metric is removed after ~ 5min from the OtelCollector. Maybe because the metrics like "ti.start" are exported in the worker context an the OtelLogger is gone if the worker task is finished?

  2. I´m not sure if singleton instance can solve the issue as Airflow is able to handle multiple workers, schedulers and .... For me it looks like the metrics which have the same labels are overwriting each other if they are exported by two different Pods (E.g. 2 workers). So not possible to increase a counter like "ti.start" with label "dag_id" and "task_id" if 2 dag_runs are running in parallel on different workers. Expecting: airflow_ti_start{dag_id="dag1", task_id="task1"} 2 but getting: airflow_ti_start{dag_id="dag1", task_id="task1"} 1

Two main issue to solve:

  1. Any idea how to make the OtelLogger static for one worker pod as the tasks are executed e.g. with the StandardTaskRunner, but then we still have to solve 2)?
  2. How to get trust able metrics in multi Pod deployments.

I write here to start the discussion again and get also some feedback from you all to improve the thinks. Not sure about the best way to fix this issue. We need to solve both points to get usable metrics via Otel.

Started already to improve Airflow Otel implementation a different point https://github.com/apache/airflow/pull/46510

AutomationDev85 avatar Feb 07 '25 13:02 AutomationDev85

Just adding more details to the issue:

  1. metrics like "ti.start" and "ti.finish" are exported in the worker context with label dag_id and task_id. The metrics are only available during the time where the task is running. Looks metrics are gone after task finished and then the metric is removed after ~ 5min from the OtelCollector. Maybe because the metrics like "ti.start" are exported in the worker context an the OtelLogger is gone if the worker task is finished?

I have tested this out, seems like this is the issue for some of the missing metrics.

I have a simple hello world task running with a PythonOperator. When running the task normally, no metric on a task level was appearing in the otel-collector logs and in my prometheus.

As a hello world task runs almost instantly there is no time for the otel-collector to grab this metrics and expose it. I added manually a sleep of 1 minute and then the metrics "ti.start" started to appear in both otel-collector logs and prometheus. The metrics of instance "ti.finish" were also correctly initialized with zero. The problem is: we will never be able to get the final task state metric updated because once the task is finalized the metrics are gone without giving time to otel-collector to grab and export those.

Not sure how the fix for this would look like.

PauloCarneiro99 avatar Feb 25 '25 09:02 PauloCarneiro99

I believe that's what was currently happening. In order to perhaps fix this issue, we might want to explore to jettison (or flush) out the metrics when the operator finishes, so that when the operation finishes, those metrics are not lost.

Another comment is that perhaps rather than monitoring when the task has started and finished using metrics, a better approach could be to monitor them using traces, since Airflow's trace implementation does not have this issue, and is much more natural way of monitoring DAG runs (this may not be relevant to the thread, but I've always believed traces are better way to observe something running)

On Tue, Feb 25, 2025 at 3:40 AM Paulo Andre de Oliveira Carneiro < @.***> wrote:

Just adding more details to the issue:

  1. metrics like "ti.start" and "ti.finish" are exported in the worker context with label dag_id and task_id. The metrics are only available during the time where the task is running. Looks metrics are gone after task finished and then the metric is removed after ~ 5min from the OtelCollector. Maybe because the metrics like "ti.start" are exported in the worker context an the OtelLogger is gone if the worker task is finished?

I have tested this out, seems like this is the issue for some of the missing metrics.

I have a simple hello world task running with a PythonOperator. When running the task normally, no metric on a task level was appearing in the otel-collector logs and in my prometheus.

As a hello world task runs almost instantly there is no time for the otel-collector to grab this metrics and expose it. I added manually a sleep of 1 minute and then the metrics "ti.start" started to appear in both otel-collector logs and prometheus. The metrics of instance "ti.finish" were also correctly initialized with zero https://github.com/apache/airflow/blob/42406162cac1d5f899dd22c91d9caa06b19956e5/airflow/models/taskinstance.py#L246. The problem is: we will never be able to get the final task state metric updated because once the task is finalized the metrics are gone without giving time to otel-collector to grab and export those.

Not sure how the fix for this would look like.

— Reply to this email directly, view it on GitHub https://github.com/apache/airflow/issues/41822#issuecomment-2681345475, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHZNLLWHRCQCNHKPWACJJCT2RQ3CTAVCNFSM6AAAAABNH7AQCOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOBRGM2DKNBXGU . You are receiving this because you were mentioned.Message ID: @.***> [image: PauloCarneiro99]PauloCarneiro99 left a comment (apache/airflow#41822) https://github.com/apache/airflow/issues/41822#issuecomment-2681345475

Just adding more details to the issue:

  1. metrics like "ti.start" and "ti.finish" are exported in the worker context with label dag_id and task_id. The metrics are only available during the time where the task is running. Looks metrics are gone after task finished and then the metric is removed after ~ 5min from the OtelCollector. Maybe because the metrics like "ti.start" are exported in the worker context an the OtelLogger is gone if the worker task is finished?

I have tested this out, seems like this is the issue for some of the missing metrics.

I have a simple hello world task running with a PythonOperator. When running the task normally, no metric on a task level was appearing in the otel-collector logs and in my prometheus.

As a hello world task runs almost instantly there is no time for the otel-collector to grab this metrics and expose it. I added manually a sleep of 1 minute and then the metrics "ti.start" started to appear in both otel-collector logs and prometheus. The metrics of instance "ti.finish" were also correctly initialized with zero https://github.com/apache/airflow/blob/42406162cac1d5f899dd22c91d9caa06b19956e5/airflow/models/taskinstance.py#L246. The problem is: we will never be able to get the final task state metric updated because once the task is finalized the metrics are gone without giving time to otel-collector to grab and export those.

Not sure how the fix for this would look like.

— Reply to this email directly, view it on GitHub https://github.com/apache/airflow/issues/41822#issuecomment-2681345475, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHZNLLWHRCQCNHKPWACJJCT2RQ3CTAVCNFSM6AAAAABNH7AQCOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOBRGM2DKNBXGU . You are receiving this because you were mentioned.Message ID: @.***>

howardyoo avatar Feb 25 '25 17:02 howardyoo

rather than monitoring when the task has started and finished using metrics, a better approach could be to monitor them using traces

@howardyoo mhm, switching all metrics to traces would be a huge re-write in the core and be a breaking change towards StatsD. I think this is not a easy fix and would change the strategy completely. Reading the comments I rather fear that we have a general conceptual problem is metrics are "just not working". And for Airflow we need to assume we have a distributed environment. Relying on singletons is... how are other applications implementing this? Airflow can not be the first using Otel?

jscheffl avatar Feb 27 '25 21:02 jscheffl

Airflow recently release support for Otel traces, which emits trace in otel, if you haven’t tried. It does not replace the metrics but adds additional details on how dags are running, so might be an option.Sent from my iPhoneOn Feb 27, 2025, at 3:36 PM, Jens Scheffler @.***> wrote:

rather than monitoring when the task has started and finished using metrics, a better approach could be to monitor them using traces

@howardyoo mhm, switching all metrics to traces would be a huge re-write in the core and be a breaking change towards StatsD. I think this is not a easy fix and would change the strategy completely. Reading the comments I rather fear that we have a general conceptual problem is metrics are "just not working". And for Airflow we need to assume we have a distributed environment. Relying on singletons is... how are other applications implementing this? Airflow can not be the first using Otel?—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

jscheffl left a comment (apache/airflow#41822)

rather than monitoring when the task has started and finished using metrics, a better approach could be to monitor them using traces

@howardyoo mhm, switching all metrics to traces would be a huge re-write in the core and be a breaking change towards StatsD. I think this is not a easy fix and would change the strategy completely. Reading the comments I rather fear that we have a general conceptual problem is metrics are "just not working". And for Airflow we need to assume we have a distributed environment. Relying on singletons is... how are other applications implementing this? Airflow can not be the first using Otel?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

howardyoo avatar Feb 28 '25 00:02 howardyoo

I made a personal workaround so that people who already migrated to Otel could use this until it is officially fixed.

Cause of issue (as I understand)

Flushing from Airflow processes to OpenTelemetry collector relies on periodic flush (default 60s). There is no guaranteed force flush at the end of the process.

Scheduler or Celery worker processes are longlived, so they have no problem. However, task instance processes (subprocess of the worker process) are ephemeral and usually spans shorter than the periodic flush, so they might never flush once.

How to verify this is the cause?

I tested with this dummy DAG. if you activate sleep(), you can see the metrics in otelcol; if you remove it, you mostly cannot.

from airflow.stats import Stats

def dummy():
    Stats.gauge("dummy_gauge", 1, {"tag1": "value1", "tag2": "value2"})
    Stats.incr("dummy_incr", 1, {"tag3": "value3", "tag4": "value4"})
    # sleep(70)

with DAG(...):
    _ = PythonOperator(
        python_callable=dummy,
        task_id="dummy",
    )

Possible Workaround

Add a force-flush callback on every task instance.

# DAG(default_args=default_args, ...)
default_args = {
        "on_success_callback": flush_metrics,
        "on_failure_callback": flush_metrics,
        "on_retry_callback": flush_metrics,
        "on_skipped_callback": flush_metrics,
}

def flush_metrics(context: dict[str, Any]) -> None:
    def flush(provider: Any) -> None:
        if provider and hasattr(provider, "force_flush"):
            provider.force_flush()

    flush(metrics.get_meter_provider())
    flush(getattr(metrics.get_meter_provider(), "_real_meter_provider", None))

Same function should be also available via plugins.event_listener. (didn't try it by myself) https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/example_dags/plugins/event_listener/index.html#module-contents

zettelin avatar Jul 17 '25 00:07 zettelin

@xBis7 -> maybe you could take a look at that ? I think that might be nice continuation of your improvements on OTEL ?

potiuk avatar Jul 18 '25 16:07 potiuk

@potiuk Thanks for pinging me. I have a few patches related to metrics that I'm planning to contribute. This is interesting, I'll add it to my backlog.

xBis7 avatar Jul 19 '25 08:07 xBis7

Hello everyone, I've started looking into this.

I'm doing some testing with both statsd and OTel and so far I've found a lot of metrics that are missing from both. But I haven't found any that are present in statsd and missing from OTel.

It looks like it has to do with isolated and short-lived processes like the one running a task. Metrics from these don't get exported.

Update: I found metrics missing just from OTel and I can confirm what previous comments have already stated.

xBis7 avatar Nov 11 '25 18:11 xBis7

Yeah. thought so.

potiuk avatar Nov 16 '25 22:11 potiuk

Hello, everyone! We have noticed the same issue as reported here since June/25. I passed by only to add to the discussion, not that we could do something (won't be upgrading Airflow versions anytime soon).

We use Airflow 2.5.1 on k8s with statsd enabled for Datadog, but just some are registered. the "Airflow did not have enough time to emit the metrics before something terminated" hypothesis is strong and makes sense; is it an actual issue?

In our case, we have legacy Airflow DAGs that are roughly composed by boto3 EMR interactions for 1. cluster creations and 2. steps additions. Our working metrics' list differs from @Ibraitas' one up there. For the ones that arrive there (e.g., ti.finish), if the task fails fast enough, we don't get the measurement with state=failed, so our monitor end up not noticing the failure.

As we intend to migrate those DAGs in the next few months to another completely different thing, using long running tasks and sensors attached, it will mitigate the fail fast issue. I can come here to tell whether it fixed our issue or not. Still, I'd like to leave the question if the "fail fast" problem is known and which Airflow version fixes it.

Thank you!

paulochf avatar Nov 17 '25 19:11 paulochf

@paulochf You probably don't see all the metrics with statsd because 2.5.1 is a very old version. I think the earliest supported version at the moment that is also regularly tested for backwards compatibility, is 2.10.X.

I've been testing in 3.1.X and I don't see any problems with statsd. The issue is with OTel because the metrics are exported in batches and some subprocesses end so fast that there isn't enough time to flush them. statsd on the other hand, exports the metrics the moment that they are captured.

I would suggest for you to upgrade soon. The differences in the architecture are so many that it's like we are discussing different projects.

xBis7 avatar Nov 18 '25 16:11 xBis7