Align timers and timing metrics (ms) across all metrics loggers
Align timers and timing metrics (ms) across all metrics loggers closed: #39818
That's likely very wrong as it is heavily backwards compatible. It will suddenly change almost all dashboards and alerts people built using those metrics.
Metrics are definitely "public" API of Airflow and we cannot change them without breaking compatibility.
100% -- this will break all the operational dashboards for companies, users and service providers
That's likely very wrong as it is heavily backwards compatible. It will suddenly change almost all dashboards and alerts people built using those metrics. Metrics are definitely "public" API of Airflow and we cannot change them without breaking compatibility.
100% -- this will break all the operational dashboards for companies, users and service providers
Yes. This is a breaking change for customers (Datadog and OTEL). As, Statsd log the timer metrics in milli seconds. So, the metrics documentation is inconsistent unless we make the all the metrics loggers emit the same metric unit (milli seconds).
Yes. But making things consistent with docs is one thing, while breaking existing deployments that baase on "actual" behaviour is another. In this case changing the metrics is a problem and if we want to make documentattion and implementation consistent, we should only update the documentation (even if it means that some metrics will have different and inconsistent behaviour).
In this case - preserving current behaviour trumps consistency.
I think we could have a new parameter that will say "consistent" metrics (disabled by default) where this might get fixed and we should swap the default value with Airflow 3 (to be consistent by default). Simply metrics is one of those things that is undervalued by most people who develop Airlfow but have the potential of wreakiing huge havoc if we change their behaviour, it's heavily public API when you think of breaking effects it might have and probability it will cause problems if you change metric values effectively by 3 orders of magnitude.
Yeah, I'm with @potiuk and @kaxil here. We already ran around in circles on this topic in the previous seconds/milliseconds metrics PR just a couple weeks ago. It's a mess, but breaking everyone's dashboards isn't a viable solution IMHO.
To add to that, even if we do standardize them, why break two (Datadog and Otel) to match one (StatsD)? If we were going to do this, shouldn't we change the one that doesn't match the other two?
This feels like an Airflow3 "big breaking thing" to me.
Hi @potiuk , @kaxil , @ferruzzi, Thanks for comments. I have introduced a flag to enable the consistency across all the logger timer metrics. By default the consistency flag is disable. I have added the warning for the same to the customer. Can you re-review it once?
I think if we want to implement this consistency, it should be Airlfow 3 feature to have it on by default and possibly we should raise a future deprecation warning informing that it will change in Airflow 3 and that user can switch to consistent metrics by flipping that config. @ferruzzi - WDYT ?
Also if we do that - I think that should be added to our Airflow 3 discussions as a topic to discuss - https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+3+Dev+call%3A+Meeting+Notes - > maybe you can add it it as command and make sure this is included in our Airflow 3 call at some point in time @dirrao ?
WHY: This is generally a serious deployment change for a number of users - but since it's a deployment and not DAG authoring change, it's likely a good candidate for one of the Airlfow 3 incompatibilities. But we now need to seriously plan what is in and out of Airflow 3 incompatibilities, and if our plan is to make a backwards-incompatible change in Airflow 3, we should warn our users before (via deprecations) and make sure this is captured in our Airflow 3 discussions.
I think it would be good to have someone who will be "championing" that - i.e. making sure it's remembered, properly docuemented, good warnings raised and people are aware of this as incompatibility - each of the "workstreams" there will have a lead/champion so having someone who will be advocating and leading that through Airlfow 3 transition (with all the consequences/documentation/making sure that it's properly communicated to our users and that there is a transition plam, including some guidelines of what our users should do) is a good idea.
I think if we want to implement this consistency, it should be Airlfow 3 feature to have it on by default and possibly we should raise a future deprecation warning informing that it will change in Airflow 3 and that user can switch to consistent metrics by flipping that config. @ferruzzi - WDYT ?
Good idea. I have updated the warning. So, users can take a call on it when they switching to Airflow 3.
warning: "Timer and timing metrics publish in seconds were deprecated. It is enabled by default from Airflow 3 onwards. Enable metrics consistency to publish all the timer and timing metrics in milliseconds."
Also if we do that - I think that should be added to our Airflow 3 discussions as a topic to discuss - https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+3+Dev+call%3A+Meeting+Notes - > maybe you can add it it as command and make sure this is included in our Airflow 3 call at some point in time @dirrao ?
I don't have the access to wiki. I will remind you on this.
I don't have the access to wiki. I will remind you on this.
If you want access, send me your email in Slack
Hi @potiuk, @ferruzzi I have updated the code as per the previous comments. If you can cross check and confirm the same, then I will update the test cases for the same.
looks gpod (sorry for delay)
@potiuk I am seeing this deprecation warning everywhere. As it is being triggered during initialization. is there any place where we can add it ignore it for the all the test cases? is there any better way to handle this?
________________________________________________________________________________ ERROR at setup of TestJdbcHook.test_suppress_and_warn_when_raised_exception_is_not_suppressed ________________________________________________________________________________
request = <SubRequest 'initialize_airflow_tests' for <Function test_jdbc_conn_connection>>
@pytest.fixture(autouse=True, scope="session")
def initialize_airflow_tests(request):
"""Helper that setups Airflow testing environment."""
print(" AIRFLOW ".center(60, "="))
# Setup test environment for breeze
home = os.path.expanduser("~")
airflow_home = os.environ.get("AIRFLOW_HOME") or os.path.join(home, "airflow")
print(f"Home of the user: {home}\nAirflow home {airflow_home}")
# Initialize Airflow db if required
lock_file = os.path.join(airflow_home, ".airflow_db_initialised")
if not skip_db_tests:
if request.config.option.db_init:
print("Initializing the DB - forced with --with-db-init switch.")
initial_db_init()
elif not os.path.exists(lock_file):
print(
"Initializing the DB - first time after entering the container.\n"
"You can force re-initialization the database by adding --with-db-init switch to run-tests."
)
> initial_db_init()
tests/conftest.py:383:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/conftest.py:346: in initial_db_init
from tests.test_utils.compat import AIRFLOW_V_2_8_PLUS, AIRFLOW_V_2_10_PLUS
tests/test_utils/compat.py:28: in <module>
from airflow.models import Connection, Operator
airflow/models/__init__.py:79: in __getattr__
val = import_string(f"{path}.{name}")
airflow/utils/module_loading.py:39: in import_string
module = import_module(module_path)
/usr/local/lib/python3.8/importlib/__init__.py:127: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
airflow/models/operator.py:22: in <module>
from airflow.models.baseoperator import BaseOperator
airflow/models/baseoperator.py:85: in <module>
from airflow.models.taskinstance import TaskInstance, clear_task_instances
airflow/models/taskinstance.py:103: in <module>
from airflow.models.xcom import LazyXComSelectSequence, XCom
airflow/models/xcom.py:52: in <module>
from airflow.utils.json import XComDecoder, XComEncoder
airflow/utils/json.py:27: in <module>
from airflow.serialization.serde import CLASSNAME, DATA, SCHEMA_ID, deserialize, serialize
airflow/serialization/serde.py:406: in <module>
_register()
airflow/serialization/serde.py:391: in _register
_stringifiers[c] = name
airflow/metrics/protocols.py:112: in __exit__
self.stop()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <airflow.metrics.protocols.Timer object at 0xffff73988370>, send = True
def stop(self, send: bool = True) -> None:
"""Stop the timer, and optionally send it to stats backend."""
if self._start_time is not None:
if metrics_consistency_on:
self.duration = 1000.0 * (time.perf_counter() - self._start_time) # Convert to milliseconds.
else:
> warnings.warn(
"Timer and timing metrics publish in seconds were deprecated. It is enabled by default from Airflow 3 onwards. Enable metrics consistency to publish all the timer and timing metrics in milliseconds.",
AirflowProviderDeprecationWarning,
stacklevel=2,
)
E airflow.exceptions.AirflowProviderDeprecationWarning: Timer and timing metrics publish in seconds were deprecated. It is enabled by default from Airflow 3 onwards. Enable metrics consistency to publish all the timer and timing metrics in milliseconds.
airflow/metrics/protocols.py:127: AirflowProviderDeprecationWarning
Hmm.. seing it - I am not sure if warning is a good idea. The thing with warnings is that you should have an "easy" way to get rid of the warning - and changing your dashboard/metrics is likely not an easy task - and I think one of the worst cases with warnings is to not allow people to silence them easily (by fixing them) . I am just thiking whether we should allow a third state of the switch ("legacy" ?) to allow the users to deliberately set it and silence the warning ?
BTW. The way how to treat warnings in tests is described in the docs: https://github.com/apache/airflow/blob/main/contributing-docs/testing/unit_tests.rst#handling-warnings
I have updated the unit tests for the changes. I tried adding the ignore warning in pyproject.toml file. However it's not working.
'ignore:Timer and timing metrics publish in seconds were deprecated. It is enabled by default from Airflow 3 onwards. Enable metrics consistency to publish all the timer and timing metrics in milliseconds.'
I'd say with Airflow 3 work started - we should hold on with "fixing" it in Airflow 2 and simply follow whatever we decide on for Airflow 3.
That needs someone to take a lead on driviing to consensus and agreeing our metrics/telemetry strategy for Airflow 3:
- Are we going to stay with Statsd or OTEL only (I.e. is there something missing from OTEL that will make us stay with statsd)
- If we are going to stay with statsd - should we align all the timers and "break" compatibility
I think what @shubham22 mentioned at last dev call they are going to verify if that is something that they heavily would rely on and it will be discussed further at the dev calls.
I'm fine with millis. My main problem is standardizing on the statsd and breaking everyone else. In fact, after the last Airflow 3 dev call where they were discussing dropping statsd entirely in favor of OTel, I'll double down on that objection. We should not be basing this decision on how statsd handles it, especially if we may not even be using statsd in the codebase in a few months.
@potiuk / @ferruzzi / @kaxil Can someone let us know the latest update on telemetry track? Can we resume this PR?
See discussions we have about OTEL at devlist - both separate discussion by Dennis and summary of the devcal discussion where we discussed it.
and see this one as well. https://github.com/apache/airflow/issues/40800
OTel will be the default starting in 3.0, StatsD will be included but secondary/backup/legacy. Given that context, Here's what I think:
This can be added to the 3.0 changes and:
- in 3.0 all timers will emit in millis
- the "metrics consistency" flag should default to "on" and let users flip it back to legacy if they need time to migrate, and
- add a deprecation path to that consistency flag so we can eventually remove the option from the config (in 3.1?) and clean up the code.
@ferruzzi I have enabled the consistency flag on by default. Please take a look at the PR once.
remove the option from the config (in 3.1?)
I think in 4.0 - but also I think we will get to much faster "MAJOR" release cycle. We have not decided it yet, but we might want to do just that and say release new Airflow MAJOR release every year ?
Hi @kaxil / @ferruzzi, I'm awaiting final approval to proceed with the merge. Could you please take a look when you get a chance?