airflow
airflow copied to clipboard
Remove StatsD and replace it with Open Telemetry as first-class citizen
Currently, Apache Airflow uses StatsD for metrics collection and monitoring. To modernize our observability stack and align with industry standards, we should remove StatsD and adopt OpenTelemetry as the primary metrics collection and monitoring tool for Airflow 3. This is made easier with AIP-49 Open Telemetry Support.
Backward Compatibility / Migration
Most of the enterprise Airflow deployments rely on StatsD for monitoring & alerting so we should make sure that there is a smooth migration path. Ideally, all the StatsD metrics mentioned in https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html#metric-descriptions should have a like-by-like replacement.
High-level Objectives:
- Remove StatsD Integration
- Use OpenTelemetry as the default metrics collection tool. This includes setting up configurations, dependencies, and code integration.
- Documentation Update: Update the official documentation to reflect the change from StatsD to OpenTelemetry, including setup, configuration, and usage instructions. Include migration guide for users transitioning from StatsD to OpenTelemetry if there are breaking changes.
- Backward Compatibility: Ensure that any existing workflows dependent on StatsD are either seamlessly transitioned to OpenTelemetry or provide clear migration guidelines.
Hi @kaxil I'm interested in this issue, may I have a try for this, thanks!
Awesome, assigned it to you
Hi @josix Let me know if you need any help.
also provide a simple grafana dashboard in the example stack ( cause the main airflow grafana dashboard open source is based on statsd ) would help a lot
@howardyoo is also interested in working on this. He led AIP-49, so @josix if you need help please let Howard no.
@howardyoo is equally interested in leading this effort too
Yeah, I just checked out the AIP and the sharing in the Airflow Submit these days, and currently worked on studying the codebase around StatsD and OTel. I believe it would be a better choice for @howardyoo to lead this topic 🙂 Please feel free to unassign me and assign sub-items to me if possible, thanks for the coordination.
I am interested too in this. Please assign me any tasks needed to be done. I am working on similar lines on this in this PR https://github.com/apache/airflow/pull/39908
@howardyoo Could you add a comment, please? GitHub doesn't allow me to assign you an issue until you have commented on it since you aren't part of the Apache org
@howardyoo Could you add a comment, please? GitHub doesn't allow me to assign you an issue until you have commented on it since you aren't part of the Apache org
Ok, done!
Awesome, assigned it to you.
Oh, I guess the above mentioned PR draft got released, which is nice - but looks like it may be on hold for now.
Will StatsD support be dropped with AF 3 or will OTEL become primary for AF 3 with StatsD as an alternative/backup for N versions to ensure there is a smooth and safe migration path to OTEL?
Will StatsD support be dropped with AF 3 or will OTEL become primary for AF 3 with StatsD as an alternative/backup for N versions to ensure there is a smooth and safe migration path to OTEL?
My opinion is not to drop statsD support immediately, but have OTEL switch over to statsD as primary, and statsD can still be alternative/backup.
Will StatsD support be dropped with AF 3 or will OTEL become primary for AF 3 with StatsD as an alternative/backup for N versions to ensure there is a smooth and safe migration path to OTEL?
My opinion is not to drop statsD support immediately, but have OTEL switch over to statsD as primary, and statsD can still be alternative/backup.
Ok but what are your arguments in favor of this policy?
Will StatsD support be dropped with AF 3 or will OTEL become primary for AF 3 with StatsD as an alternative/backup for N versions to ensure there is a smooth and safe migration path to OTEL?
My opinion is not to drop statsD support immediately, but have OTEL switch over to statsD as primary, and statsD can still be alternative/backup.
Ok but what are your arguments in favor of this policy?
My favor to this policy, is that
- Opentelemetry has become quite stable and mature, and is vendor neural way of instrumenting softwares
- StatsD only covers metrics, while Opentelemetry covers metrics, traces, and logs
- StatsD has limitations in terms of number of attributes, or metric names, while Opentelemetry is less limited
- Opentelemetry also provides opentelemetry collector which can further be used to control how the telemetry could be transferred.
Yes otel is better ,but why keep statsd ?
I was originally promoting the idea of attempting to use otel metrics to provide backwards compatibility with StatsD (as an option). I think at the last call @ferruzzi mentioned that I was the proponent of replacing StatsD with OpenTelemetry, but in fact it was a bit conditional - let's use single "producing" mechanism (OTEL) but let's make it produce statsd-compatible metrics so that existing dashboards continue to work.
I think there are certain limitations (metrics name length for one) that make it either difficult or impossible or at the very least imperfect. So I am in favour of getting OTEL as "default" interface, but leaving (at least for a long while) statsd as an optional legacy mechanism - available to those who already monitor their airflow with statsd.
The thing is that it's not very difficult to keep both. Metrics change rarely, but most importantly - keeping statsd metrics in the code (since those are only metrics and we keep all the data to produce for OTEL anyway) is not a big overhead.
Yes, it introduces some duplication and possibly the OTEL vs. STATSD metrics will - over time - diverge even more, but that's quite fine. If we focus on improving and making OTEL metrics more useful while treat Statsd as legacy, read-only that we do not actively maintain, this will make OTEL even more attractive over time and we could even recommend people - if you want to get better / improved metrics go OTEL if you are stil using statsd. But we should give people more time to do so and let them do it independently from the decision on migration to Airflow 3.
I see statsd as a way how we can improve adoption of Airflow 3. If we expect future Airflow 3 users to migrate quickly, adding them "yet-another-thing-to-learn-and-configure" gives compounding effect on the difficulty of migrateion and might be a factor where they decide to defer migration decision. And since it does not cost us a lot of maintenance - this is a "cheap" overhead.
I was originally promoting the idea of attempting to use
otelmetrics to provide backwards compatibility with StatsD (as an option). I think at the last call @ferruzzi mentioned that I was the proponent of replacing StatsD with OpenTelemetry, but in fact it was a bit conditional - let's use single "producing" mechanism (OTEL) but let's make it produce statsd-compatible metrics so that existing dashboards continue to work.I think there are certain limitations (metrics name length for one) that make it either difficult or impossible or at the very least imperfect. So I am in favour of getting OTEL as "default" interface, but leaving (at least for a long while) statsd as an optional legacy mechanism - available to those who already monitor their airflow with statsd.
The thing is that it's not very difficult to keep both. Metrics change rarely, but most importantly - keeping statsd metrics in the code (since those are only metrics and we keep all the data to produce for OTEL anyway) is not a big overhead.
Yes, it introduces some duplication and possibly the OTEL vs. STATSD metrics will - over time - diverge even more, but that's quite fine. If we focus on improving and making OTEL metrics more useful while treat Statsd as legacy, read-only that we do not actively maintain, this will make OTEL even more attractive over time and we could even recommend people - if you want to get better / improved metrics go OTEL if you are stil using statsd. But we should give people more time to do so and let them do it independently from the decision on migration to Airflow 3.
I see statsd as a way how we can improve adoption of Airflow 3. If we expect future Airflow 3 users to migrate quickly, adding them "yet-another-thing-to-learn-and-configure" gives compounding effect on the difficulty of migrateion and might be a factor where they decide to defer migration decision. And since it does not cost us a lot of maintenance - this is a "cheap" overhead.
@potiuk beat me into responding! but yes, the overhead is little, statsd had been the default choice for years, so maintaining backward compatibility is still important, yet now users who are using statsd could experiment and adopt otel as their next telemetry while they can still use what they're used to would be my opinion.
I think at the last call @ferruzzi mentioned that I was the proponent of replacing StatsD with OpenTelemetry, but in fact it was a bit conditional
My apologies for misrepresenting your stand on that. I shouldn't have spoken for you.
I have mentioned my solution in the past but perhaps I need to codify it, or open an Issue and let someone else tackle it. Currently there are three metrics backends (called loggers) StasD, OTel, and Datadog (which as far as I can tell is a variant on StatsD which supports tagging). The biggest compatibility issue is that OTel has a much shorter max name length (63 characters), but supports tagging. StatsD has a much longer (300 character?) name length limit but does not support tagging.
Solution:
Add a "generate name" abstract method to base metrics logger, something like:
def get_name(name: str, tags:dict) -> str:
...
Each logger will then need to implement this and call it before actually emitting the metric. For otel, it would just return name. for StatsD it would return the name with the tags dict concatenated onto it, something more or less like this should work but there's likely a better way
def get_name(name: str, tags:dict) -> str:
suffix = str(tags)[1:-1] # drop the open and close curly braces
.replace(": ", "_") # replace the key/value delimiter with an underscore
.replace(", ", "_") # replace the comma between keys/value pairs with an underscore
.replace("'", "") # remove the quotes around the key name
return f"{name}_{suffix}"
Then we can stop double-emitting, remove the "we have to truncate this name" name length warning, and let each logger handle naming errors however it sees fit.
[EDIT] shorter but nowhere near as clear statsd implementation
def get_name(name: str, tags:dict) -> str:
return f'{name}_{("_").join([f"{k}_{v}" for (k, v) in tags.items()])}'
My opinion is not to drop statsD support immediately, but have OTEL switch over to statsD as primary, and statsD can still be alternative/backup.
Yup, I agree with this
I think at the last call @ferruzzi mentioned that I was the proponent of replacing StatsD with OpenTelemetry, but in fact it was a bit conditional
My apologies for misrepresenting your stand on that. I shouldn't have spoken for you.
I have mentioned my solution in the past but perhaps I need to codify it, or open an Issue and let someone else tackle it. Currently there are three metrics backends (called loggers) StasD, OTel, and Datadog (which as far as I can tell is a variant on StatsD which supports tagging). The biggest compatibility issue is that OTel has a much shorter max name length (63 characters), but supports tagging. StatsD has a much longer (300 character?) name length limit but does not support tagging.
Solution:
Add a "generate name" abstract method to base metrics logger, something like:
def get_name(name: str, tags:dict) -> str: ...Each logger will then need to implement this and call it before actually emitting the metric. For otel, it would just return name. for StatsD it would return the name with the tags dict concatenated onto it, something more or less like this should work but there's likely a better way
def get_name(name: str, tags:dict) -> str: suffix = str(tags)[1:-1] # drop the open and close curly braces .replace(": ", "_") # replace the key/value delimiter with an underscore .replace(", ", "_") # replace the comma between keys/value pairs with an underscore .replace("'", "") # remove the quotes around the key name return f"{name}_{suffix}"Then we can stop double-emitting, remove the "we have to truncate this name" name length warning, and let each logger handle naming errors however it sees fit.
[EDIT] shorter but nowhere near as clear statsd implementation
def get_name(name: str, tags:dict) -> str: return f'{name}_{("_").join([f"{k}_{v}" for (k, v) in tags.items()])}'
Hi @ferruzzi ,i don't know if you see this , but opentelemetry specification goes from 63 characters to 255 https://github.com/open-telemetry/opentelemetry-specification/pull/3648 . With OpenTelemetry configuration i got a lot of truncate metrics . Specially those with the dag name inside . it's very difficult to predict the metrics name and so make observability dashboard automation
exceeds OpenTelemetry's name length limit of 63 characters and will be truncated to
Thanks for pointing it out. We have a fix for it coming in https://github.com/apache/airflow/pull/43340
I would really love to have otel as default to replace StatsD but as also discussed in #41822 as well as in https://lists.apache.org/thread/3vnktmzh02mxwbf06p3cdhnr18x6kcl5 we must resolve the basic problems of otel first. It is a clear must to be able to run otel in a distributed environment with "correct" metrics (counter/gauges). Even if otel has grater features the basics must work.
I agree. TLDR context here: when I implemented it, I did not do it in a way that accounts for multiple workers. Essentially each worker (and scheduler, etc) is keeping track of their own metrics and then it's a race condition on which get emitted. We need a central "collection agency" they can all report to which collates all the metrics.