airflow icon indicating copy to clipboard operation
airflow copied to clipboard

Fix dag task scheduled and queued duration metrics

Open htpawel opened this issue 1 year ago • 30 comments


From statsd documentation (calling-timing-manually):


# Pass milliseconds directly

start = time.time()
time.sleep(3)
# You must convert to milliseconds:
dt = int((time.time() - start) * 1000)
statsd.timing('slept', dt)

dt int time must be in milliseconds rather than seconds. You can check other examples in airflow code also. Then statsd lib converts it to seconds before exporting metric.

htpawel avatar Mar 06 '24 12:03 htpawel

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst) Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits. Apache Airflow is a community-driven project and together we are making it better 🚀. In case of doubts contact the developers at: Mailing List: [email protected] Slack: https://s.apache.org/airflow-slack

boring-cyborg[bot] avatar Mar 06 '24 12:03 boring-cyborg[bot]

cc: @ferruzzi @Bowrna -> I remember there were discussions about it in the past . Did we decide to leave it in s for back-compatibility ?

potiuk avatar Mar 10 '24 19:03 potiuk

@potiuk @dirrao I searched entire airflow 2.7.1 code for Stats.timing and except this place only we are providing delta correctly. Only here we are calling .total_seconds() on delta which is incorrect. Current metrics documentation states that those two metrics are in seconds, but they are not. Right now to have useful data (in seconds) you need to multiply metric value by 1000 - which is obvious bug and mismatch between documentation and reality and should be fixed asap imo.

htpawel avatar Mar 11 '24 12:03 htpawel

@potiuk @dirrao I searched entire airflow 2.7.1 code for Stats.timing and except this place only we are providing delta correctly. Only here we are calling .total_seconds() on delta which is incorrect. Current metrics documentation states that those two metrics are in seconds, but they are not. Right now to have useful data (in seconds) you need to multiply metric value by 1000 - which is obvious bug and mismatch between documentation and reality and should be fixed asap imo.

@htpawel, I have checked the documentation. If so, then documentation needs to updated.

dirrao avatar Mar 11 '24 12:03 dirrao

@dirrao Nope, documentation is correct I believe, we want it to be seconds, but right now they are not seconds in result because of above bug

htpawel avatar Mar 11 '24 12:03 htpawel

This would break the dashboard of any user currently monitoring that metric and their timers will suddenly show 1000x longer durations, right?

ferruzzi avatar Mar 11 '24 18:03 ferruzzi

@ferruzzi If someone is using it right now and applied temporary workaround (added *1000 to metric value) with TODO comment like we did - yes, it might break their dashboard. But I don't think it's the reason to not fix this :D First of all, documentation says it is seconds as result, so documentation is lying to us right now. Second, it will be in release notes so if someone is updating airflow without reading them - it's their fault. Third, it's new metric so I doubt many people are using it. Some of them might find its result useless (like I did at the beginning) and drop, some of them might just applied it and be happy they don't have any queues because they believe in documentation that those are seconds. In general it does not look good if you don't want to fix your code to work properly just because people might have already made ugly workarounds themselves to fix your bug IMHO.

htpawel avatar Mar 12 '24 07:03 htpawel

I never said don't fix it. It's just a matter of if we call it a bugfix and fix it now, creating a breaking change and sending users scrambling to figure out why their queues are suddenly 1000 times longer, or if we have to go through the deprecation process so users are informed.

Since this is documented as seconds, I think we can call it a bugfix and just do it, but it's still something that needs to be decided.

ferruzzi avatar Mar 12 '24 17:03 ferruzzi

cc: @ferruzzi @Bowrna -> I remember there were discussions about it in the past . Did we decide to leave it in s for back-compatibility ?

timedelta object statsd handles internally by converting to milliseconds. But there are some case where we explicitly pass the seconds instead of timedelta object and it is stored in statsd as seconds. so I added a fix with different key name where the newly added key name persists the data in milliseconds while the older key stays for backward compatibility with a deprecation message.

Bowrna avatar Mar 22 '24 09:03 Bowrna

Hey @vandonr - since you are the original author of #30612 and likely will be easier to get all the context back - maybe you can comment on that one? Is that the right fix ? Or maybe @ferruzzi @o-nikolas - since you were around when it was implemented - is there any reason those durations should be kept as they are (and maybe there is a constructive proposal how to handle that case. It looks like we have somewhat long lasting issues that those new metrics introduced in #30612 caused and possibly we should find a way how to address it before 2.9.0 rc2 is out ?

potiuk avatar Mar 26 '24 20:03 potiuk

I don't have anything against converting this metric to milliseconds, I believe I wasn't aware of that statsd recommendation when I wrote that code.

However, there are plenty of other timers that are emitted in seconds in airflow, and I think if we make a migration effort, we might as well migrate all at once rather than little by little.

We must be aware of the high impact this can have for users as well: if they have threshold alerts on those metrics, they will certainly ring when they see the metric go x1000.

vandonr avatar Mar 27 '24 20:03 vandonr

@vandonr You probably didn't read the topic fully or misunderstood. Airflow indeed emits most of metrics (or all of them) in seconds, and those two should also be emitted in seconds like you wanted to achieve and like it is stated in documentation and like I also want it to be (and everyone I suppose). But unfortunately they are NOT right now. It is because you need to pass milliseconds or delta time object to Statsd timing (then it will emit metric in seconds). But you are passing seconds which is incorrect. Check all other places in Airflow code - delta time object is passed. This is obvious bug which we need to fix with above bugfix.

htpawel avatar Mar 27 '24 21:03 htpawel

@vandonr You probably didn't read the topic fully or misunderstood. Airflow indeed emits most of metrics (or all of them) in seconds, and those two should also be emitted in seconds like you wanted to achieve and like it is stated in documentation and like I also want it to be (and everyone I suppose). But unfortunately they are NOT right now. It is because you need to pass milliseconds or delta time object to Statsd timing (then it will emit metric in seconds). But you are passing seconds which is incorrect. Check all other places in Airflow code - delta time object is passed. This is obvious bug which we need to fix with above bugfix.

@htpawel https://github.com/apache/airflow/issues/33426#issuecomment-1790610192

Please check here. Statsd timing internally sends the stat in milliseconds. I am sharing this here as you told Statsd timing emit metric in seconds. Does statsd sets the data in milliseconds but emits in seconds?

Bowrna avatar Mar 28 '24 05:03 Bowrna

Ok now I am not 100% sure it emits it in seconds (it will require further investigation), but anyway it is 100% clear that it expects time delta or milliseconds as input, not seconds (it is even said in the code you sent in comment), to work properly.

htpawel avatar Mar 28 '24 07:03 htpawel

Ok, I did further investigation and now I know everything :D Statsd by convention should always emit milliseconds. (See ms in their code: self._send_stat(stat, '%0.6f|ms' % delta, rate)). And that is why they are requesting time delta object or value in milliseconds: (in their code: delta can be either a number of milliseconds or a timedelta.). But later you need to do something with those metrics, so we (and probably most of the people) use statsd-exporter from prometheus. And in statsd-exporter documentation you can read Timers will be accepted with the ms statsd type. Statsd timer data is transmitted in milliseconds, while Prometheus expects the unit to be seconds. The exporter converts all timer observations to seconds.. So that is why I thought it is emitted in seconds, I was just checking statsd-exporter /metrics endpoint. So it turns out that not this, but other metrics are wrong in documentation, eg. dagrun.schedule_delay.<dag_id> as it passes time delta to Stats.timing so emits milliseconds but airflow documentation says those are seconds. And above two from this PR are indeed emitted in seconds, but obviously it is not correct either as those should be milliseconds and are emitted with ms suffix. So there are more things f*cked up generally. Timing/Timer should always receive milliseconds value or time delta object and it emits milliseconds, which can be later consumed by statsd-exporter and exported in seconds for prometheus to scrape them.

htpawel avatar Mar 28 '24 08:03 htpawel

Sounds like a breaking bug-fix to me . I.e. yes - we know it will be breaking, but it was broken in the first place anyway. @ferruzzi - WDYT? I'd merge it and add significant note about it

potiuk avatar Apr 02 '24 18:04 potiuk

I has been using Airflow with statsd-exporter for quite a long time and to be honest, while I found the document about the timers sometimes inconsistent, the values exported by Statsd which we are using look good to me (except task scheduled and queued duration metrics as I have not used them yet) So +1 for this PR and a fix in the document 🙇

tanvn avatar Apr 03 '24 01:04 tanvn

So dag.<dag_id>.<task_id>.queued_duration documented as Seconds a task spends in the Queued state, before being Running ends up exported to Prometheus as a value in kiloseconds :D. You have to multiply it by 1000 to get a value in seconds.

HTRafal avatar Apr 04 '24 08:04 HTRafal

KILOSECONDS? Why the heck used KILOSECONDS for anything?

It looks like consensus is that we don't have to deal with deprecation, so that's great.

ferruzzi avatar Apr 04 '24 18:04 ferruzzi

It looks like consensus is that we don't have to deal with deprecation, so that's great.

Agree.

potiuk avatar Apr 04 '24 18:04 potiuk

@Bowrna @HTRafal @tanvn - Does this proposed change look like it will address your comments? If you all confirm, I'll approve and we can get it merged. The code looks fin to me, just want to confirm that it will fix what you are seeing.

@htpawel - Can you think of some kind of unit test on (Stats.timer, satsd_logger.timer and/or otel_logger.timer) that asserts the expected output format? Maybe create a timer, sleep(1), and assert the duration is how we want it to look? I know this was a really small thing and it really dragged out, sorry about that.

ferruzzi avatar Apr 04 '24 21:04 ferruzzi

@Bowrna @HTRafal @tanvn - Does this proposed change look like it will address your comments? If you all confirm, I'll approve and we can get it merged. The code looks fin to me, just want to confirm that it will fix what you are seeing.

@htpawel - Can you think of some kind of unit test on (Stats.timer, satsd_logger.timer and/or otel_logger.timer) that asserts the expected output format? Maybe create a timer, sleep(1), and assert the duration is how we want it to look? I know this was a really small thing and it really dragged out, sorry about that.

yes it does address the comments I shared.

Bowrna avatar Apr 05 '24 08:04 Bowrna

@Bowrna @HTRafal @tanvn - Does this proposed change look like it will address your comments? If you all confirm, I'll approve and we can get it merged. The code looks fin to me, just want to confirm that it will fix what you are seeing.

@htpawel - Can you think of some kind of unit test on (Stats.timer, satsd_logger.timer and/or otel_logger.timer) that asserts the expected output format? Maybe create a timer, sleep(1), and assert the duration is how we want it to look? I know this was a really small thing and it really dragged out, sorry about that.

Yes, this PR changes the duration to a timedelta, which is correct.

tanvn avatar Apr 05 '24 13:04 tanvn

Cool. @htpawel - If you can add that quick unit test, I'll approve and merge this.

ferruzzi avatar Apr 05 '24 16:04 ferruzzi

@ferruzzi @o-nikolas What test did you have in mind? I can't think of any to accomplish this:

Other than adding a unittest to ensure this doesn't happen again. I'm happy to approve the change.

Developers should use Stats.timer() context manager as much as they can, but there are scenarios like the one above when Stats.timing() is needed. Then developers should pass either timedelta object or milliseconds, but it is not possible to validate this input, you can't know if given number is milliseconds or seconds. That's why statsd third party module is not able to validate that also - it is in documentation.

htpawel avatar Apr 22 '24 16:04 htpawel

Developers should use Stats.timer() context manager as much as they can, but there are scenarios like the one above when Stats.timing() is needed. Then developers should pass either timedelta object or milliseconds, but it is not possible to validate this input, you can't know if given number is milliseconds or seconds. That's why statsd third party module is not able to validate that also - it is in documentation.

ACK, running the workflow again after the latest merge from main

o-nikolas avatar Apr 30 '24 20:04 o-nikolas

If we are not going to add any unit tests, I think it is acceptable to merge this PR now?

tanvn avatar May 11 '24 01:05 tanvn

What test did you have in mind? I can't think of any to accomplish this

If you are trying to make sure the metric is emitted in seconds and not milliseconds, maybe start a timer, sleep(1), and assert that Stats.timer() was called with a value of 1 and not 1000?

ferruzzi avatar May 14 '24 17:05 ferruzzi

What test did you have in mind? I can't think of any to accomplish this

If you are trying to make sure the metric is emitted in seconds and not milliseconds, maybe start a timer, sleep(1), and assert that Stats.timer() was called with a value of 1 and not 1000?

I've already made sure that metric is emitted in milliseconds (if you use it correctly, so pass milliseconds or delta) but if someone will pass seconds it will emit seconds obviously (with ms suffix..) and there is no test / mechanism to prevents this automatically in any way.. Developers just must know that Statsd expects milliseconds or delta object only, that's convention.

htpawel avatar May 15 '24 07:05 htpawel

Developers just must know that Statsd expects milliseconds or delta object only, that's convention.

That's horrible. So we don't have any way of knowing, catching, or preventing this from happening again? Alright. I'll approve it as-is then.

ferruzzi avatar May 15 '24 21:05 ferruzzi