airflow icon indicating copy to clipboard operation
airflow copied to clipboard

Remove StatsD and replace it with Open Telemetry as first-class citizen

Open kaxil opened this issue 1 year ago • 11 comments

Currently, Apache Airflow uses StatsD for metrics collection and monitoring. To modernize our observability stack and align with industry standards, we should remove StatsD and adopt OpenTelemetry as the primary metrics collection and monitoring tool for Airflow 3. This is made easier with AIP-49 Open Telemetry Support.

Backward Compatibility / Migration

Most of the enterprise Airflow deployments rely on StatsD for monitoring & alerting so we should make sure that there is a smooth migration path. Ideally, all the StatsD metrics mentioned in https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html#metric-descriptions should have a like-by-like replacement.

High-level Objectives:

  1. Remove StatsD Integration
  2. Use OpenTelemetry as the default metrics collection tool. This includes setting up configurations, dependencies, and code integration.
  3. Documentation Update: Update the official documentation to reflect the change from StatsD to OpenTelemetry, including setup, configuration, and usage instructions. Include migration guide for users transitioning from StatsD to OpenTelemetry if there are breaking changes.
  4. Backward Compatibility: Ensure that any existing workflows dependent on StatsD are either seamlessly transitioned to OpenTelemetry or provide clear migration guidelines.

kaxil avatar Jul 15 '24 18:07 kaxil

Hi @kaxil I'm interested in this issue, may I have a try for this, thanks!

josix avatar Jul 16 '24 16:07 josix

Awesome, assigned it to you

kaxil avatar Jul 17 '24 19:07 kaxil

Hi @josix Let me know if you need any help.

dirrao avatar Jul 18 '24 12:07 dirrao

also provide a simple grafana dashboard in the example stack ( cause the main airflow grafana dashboard open source is based on statsd ) would help a lot

raphaelauv avatar Jul 18 '24 13:07 raphaelauv

@howardyoo is also interested in working on this. He led AIP-49, so @josix if you need help please let Howard no.

@howardyoo is equally interested in leading this effort too

kaxil avatar Jul 25 '24 13:07 kaxil

Yeah, I just checked out the AIP and the sharing in the Airflow Submit these days, and currently worked on studying the codebase around StatsD and OTel. I believe it would be a better choice for @howardyoo to lead this topic 🙂 Please feel free to unassign me and assign sub-items to me if possible, thanks for the coordination.

josix avatar Jul 25 '24 14:07 josix

I am interested too in this. Please assign me any tasks needed to be done. I am working on similar lines on this in this PR https://github.com/apache/airflow/pull/39908

dirrao avatar Jul 25 '24 15:07 dirrao

@howardyoo Could you add a comment, please? GitHub doesn't allow me to assign you an issue until you have commented on it since you aren't part of the Apache org

kaxil avatar Jul 25 '24 17:07 kaxil

@howardyoo Could you add a comment, please? GitHub doesn't allow me to assign you an issue until you have commented on it since you aren't part of the Apache org

Ok, done!

howardyoo avatar Jul 25 '24 22:07 howardyoo

Awesome, assigned it to you.

kaxil avatar Jul 26 '24 08:07 kaxil

Oh, I guess the above mentioned PR draft got released, which is nice - but looks like it may be on hold for now.

howardyoo avatar Aug 28 '24 17:08 howardyoo

Will StatsD support be dropped with AF 3 or will OTEL become primary for AF 3 with StatsD as an alternative/backup for N versions to ensure there is a smooth and safe migration path to OTEL?

chomipi88 avatar Sep 04 '24 15:09 chomipi88

Will StatsD support be dropped with AF 3 or will OTEL become primary for AF 3 with StatsD as an alternative/backup for N versions to ensure there is a smooth and safe migration path to OTEL?

My opinion is not to drop statsD support immediately, but have OTEL switch over to statsD as primary, and statsD can still be alternative/backup.

howardyoo avatar Sep 04 '24 15:09 howardyoo

Will StatsD support be dropped with AF 3 or will OTEL become primary for AF 3 with StatsD as an alternative/backup for N versions to ensure there is a smooth and safe migration path to OTEL?

My opinion is not to drop statsD support immediately, but have OTEL switch over to statsD as primary, and statsD can still be alternative/backup.

Ok but what are your arguments in favor of this policy?

raphaelauv avatar Sep 04 '24 15:09 raphaelauv

Will StatsD support be dropped with AF 3 or will OTEL become primary for AF 3 with StatsD as an alternative/backup for N versions to ensure there is a smooth and safe migration path to OTEL?

My opinion is not to drop statsD support immediately, but have OTEL switch over to statsD as primary, and statsD can still be alternative/backup.

Ok but what are your arguments in favor of this policy?

My favor to this policy, is that

  • Opentelemetry has become quite stable and mature, and is vendor neural way of instrumenting softwares
  • StatsD only covers metrics, while Opentelemetry covers metrics, traces, and logs
  • StatsD has limitations in terms of number of attributes, or metric names, while Opentelemetry is less limited
  • Opentelemetry also provides opentelemetry collector which can further be used to control how the telemetry could be transferred.

howardyoo avatar Sep 04 '24 23:09 howardyoo

Yes otel is better ,but why keep statsd ?

raphaelauv avatar Sep 05 '24 05:09 raphaelauv

I was originally promoting the idea of attempting to use otel metrics to provide backwards compatibility with StatsD (as an option). I think at the last call @ferruzzi mentioned that I was the proponent of replacing StatsD with OpenTelemetry, but in fact it was a bit conditional - let's use single "producing" mechanism (OTEL) but let's make it produce statsd-compatible metrics so that existing dashboards continue to work.

I think there are certain limitations (metrics name length for one) that make it either difficult or impossible or at the very least imperfect. So I am in favour of getting OTEL as "default" interface, but leaving (at least for a long while) statsd as an optional legacy mechanism - available to those who already monitor their airflow with statsd.

The thing is that it's not very difficult to keep both. Metrics change rarely, but most importantly - keeping statsd metrics in the code (since those are only metrics and we keep all the data to produce for OTEL anyway) is not a big overhead.

Yes, it introduces some duplication and possibly the OTEL vs. STATSD metrics will - over time - diverge even more, but that's quite fine. If we focus on improving and making OTEL metrics more useful while treat Statsd as legacy, read-only that we do not actively maintain, this will make OTEL even more attractive over time and we could even recommend people - if you want to get better / improved metrics go OTEL if you are stil using statsd. But we should give people more time to do so and let them do it independently from the decision on migration to Airflow 3.

I see statsd as a way how we can improve adoption of Airflow 3. If we expect future Airflow 3 users to migrate quickly, adding them "yet-another-thing-to-learn-and-configure" gives compounding effect on the difficulty of migrateion and might be a factor where they decide to defer migration decision. And since it does not cost us a lot of maintenance - this is a "cheap" overhead.

potiuk avatar Sep 05 '24 08:09 potiuk

I was originally promoting the idea of attempting to use otel metrics to provide backwards compatibility with StatsD (as an option). I think at the last call @ferruzzi mentioned that I was the proponent of replacing StatsD with OpenTelemetry, but in fact it was a bit conditional - let's use single "producing" mechanism (OTEL) but let's make it produce statsd-compatible metrics so that existing dashboards continue to work.

I think there are certain limitations (metrics name length for one) that make it either difficult or impossible or at the very least imperfect. So I am in favour of getting OTEL as "default" interface, but leaving (at least for a long while) statsd as an optional legacy mechanism - available to those who already monitor their airflow with statsd.

The thing is that it's not very difficult to keep both. Metrics change rarely, but most importantly - keeping statsd metrics in the code (since those are only metrics and we keep all the data to produce for OTEL anyway) is not a big overhead.

Yes, it introduces some duplication and possibly the OTEL vs. STATSD metrics will - over time - diverge even more, but that's quite fine. If we focus on improving and making OTEL metrics more useful while treat Statsd as legacy, read-only that we do not actively maintain, this will make OTEL even more attractive over time and we could even recommend people - if you want to get better / improved metrics go OTEL if you are stil using statsd. But we should give people more time to do so and let them do it independently from the decision on migration to Airflow 3.

I see statsd as a way how we can improve adoption of Airflow 3. If we expect future Airflow 3 users to migrate quickly, adding them "yet-another-thing-to-learn-and-configure" gives compounding effect on the difficulty of migrateion and might be a factor where they decide to defer migration decision. And since it does not cost us a lot of maintenance - this is a "cheap" overhead.

@potiuk beat me into responding! but yes, the overhead is little, statsd had been the default choice for years, so maintaining backward compatibility is still important, yet now users who are using statsd could experiment and adopt otel as their next telemetry while they can still use what they're used to would be my opinion.

howardyoo avatar Sep 05 '24 15:09 howardyoo

I think at the last call @ferruzzi mentioned that I was the proponent of replacing StatsD with OpenTelemetry, but in fact it was a bit conditional

My apologies for misrepresenting your stand on that. I shouldn't have spoken for you.

I have mentioned my solution in the past but perhaps I need to codify it, or open an Issue and let someone else tackle it. Currently there are three metrics backends (called loggers) StasD, OTel, and Datadog (which as far as I can tell is a variant on StatsD which supports tagging). The biggest compatibility issue is that OTel has a much shorter max name length (63 characters), but supports tagging. StatsD has a much longer (300 character?) name length limit but does not support tagging.

Solution:

Add a "generate name" abstract method to base metrics logger, something like:

def get_name(name: str, tags:dict) -> str:
    ...

Each logger will then need to implement this and call it before actually emitting the metric. For otel, it would just return name. for StatsD it would return the name with the tags dict concatenated onto it, something more or less like this should work but there's likely a better way

def get_name(name: str, tags:dict) -> str:
    suffix = str(tags)[1:-1] # drop the open and close curly braces
        .replace(": ", "_") # replace the key/value delimiter with an underscore
        .replace(", ", "_") # replace the comma between keys/value pairs with an underscore
        .replace("'", "")   # remove the quotes around the key name 
    
    return f"{name}_{suffix}"

Then we can stop double-emitting, remove the "we have to truncate this name" name length warning, and let each logger handle naming errors however it sees fit.

[EDIT] shorter but nowhere near as clear statsd implementation

def get_name(name: str, tags:dict) -> str:
    return f'{name}_{("_").join([f"{k}_{v}" for (k, v) in tags.items()])}'

ferruzzi avatar Sep 05 '24 17:09 ferruzzi

My opinion is not to drop statsD support immediately, but have OTEL switch over to statsD as primary, and statsD can still be alternative/backup.

Yup, I agree with this

kaxil avatar Sep 05 '24 22:09 kaxil

I think at the last call @ferruzzi mentioned that I was the proponent of replacing StatsD with OpenTelemetry, but in fact it was a bit conditional

My apologies for misrepresenting your stand on that. I shouldn't have spoken for you.

I have mentioned my solution in the past but perhaps I need to codify it, or open an Issue and let someone else tackle it. Currently there are three metrics backends (called loggers) StasD, OTel, and Datadog (which as far as I can tell is a variant on StatsD which supports tagging). The biggest compatibility issue is that OTel has a much shorter max name length (63 characters), but supports tagging. StatsD has a much longer (300 character?) name length limit but does not support tagging.

Solution:

Add a "generate name" abstract method to base metrics logger, something like:

def get_name(name: str, tags:dict) -> str:
    ...

Each logger will then need to implement this and call it before actually emitting the metric. For otel, it would just return name. for StatsD it would return the name with the tags dict concatenated onto it, something more or less like this should work but there's likely a better way

def get_name(name: str, tags:dict) -> str:
    suffix = str(tags)[1:-1] # drop the open and close curly braces
        .replace(": ", "_") # replace the key/value delimiter with an underscore
        .replace(", ", "_") # replace the comma between keys/value pairs with an underscore
        .replace("'", "")   # remove the quotes around the key name 
    
    return f"{name}_{suffix}"

Then we can stop double-emitting, remove the "we have to truncate this name" name length warning, and let each logger handle naming errors however it sees fit.

[EDIT] shorter but nowhere near as clear statsd implementation

def get_name(name: str, tags:dict) -> str:
    return f'{name}_{("_").join([f"{k}_{v}" for (k, v) in tags.items()])}'

Hi @ferruzzi ,i don't know if you see this , but opentelemetry specification goes from 63 characters to 255 https://github.com/open-telemetry/opentelemetry-specification/pull/3648 . With OpenTelemetry configuration i got a lot of truncate metrics . Specially those with the dag name inside . it's very difficult to predict the metrics name and so make observability dashboard automation

exceeds OpenTelemetry's name length limit of 63 characters and will be truncated to

dthauvin avatar Nov 18 '24 10:11 dthauvin

Thanks for pointing it out. We have a fix for it coming in https://github.com/apache/airflow/pull/43340

ferruzzi avatar Nov 18 '24 17:11 ferruzzi

I would really love to have otel as default to replace StatsD but as also discussed in #41822 as well as in https://lists.apache.org/thread/3vnktmzh02mxwbf06p3cdhnr18x6kcl5 we must resolve the basic problems of otel first. It is a clear must to be able to run otel in a distributed environment with "correct" metrics (counter/gauges). Even if otel has grater features the basics must work.

jscheffl avatar Mar 20 '25 22:03 jscheffl

I agree. TLDR context here: when I implemented it, I did not do it in a way that accounts for multiple workers. Essentially each worker (and scheduler, etc) is keeping track of their own metrics and then it's a race condition on which get emitted. We need a central "collection agency" they can all report to which collates all the metrics.

ferruzzi avatar Mar 20 '25 22:03 ferruzzi