airflow icon indicating copy to clipboard operation
airflow copied to clipboard

Set parallelism log messages to warning level for better visiblity

Open BasPH opened this issue 9 months ago • 9 comments

This PR raises the log level of several log messages related to parallelism to warning level. IMO it's currently not evident when Airflow reaches a parallelism limit so the purpose of this PR is to make it more obvious that a scalability limit is reached.


^ Add meaningful description above Read the Pull Request Guidelines for more information. In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed. In case of a new dependency, check compliance with the ASF 3rd Party License Policy. In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

BasPH avatar Apr 28 '24 19:04 BasPH

Why should we not warn the user?

BasPH avatar Apr 28 '24 20:04 BasPH

I think the question is why should we warn the user when the parallel limits are reached?

hussein-awala avatar Apr 28 '24 20:04 hussein-awala

To inform the user that a software limit has been reached. For example, unless you enable debug logging there's no way to tell from the logs that AIRFLOW__CORE__PARALLELISM was reached. The default limits are rather low in Airflow so this is a common situation and it's currently not self-evident.

BasPH avatar Apr 28 '24 20:04 BasPH

To inform the user that a software limit has been reached. For example, unless you enable debug logging there's no way to tell from the logs that AIRFLOW__CORE__PARALLELISM was reached. The default limits are rather low in Airflow so this is a common situation and it's currently not self-evident.

I think this is more sutaible for cluster activity dashboard to expose this in actionable manner. Logs are a bit problematic because what you get here is notification that parallesim is reach but you have no information if user set this specific value by design or not. If user chose the specific value then logging warnings is not very user freidnly and in fact not actionable. This can result in further user request to silent the messages. In my point of view If we can find a way to present the information in cluster activity dashboard that would be prefered.

eladkal avatar Apr 28 '24 21:04 eladkal

-0.9 for this change, reaching the parallelism limits is a normal and expected thing and we should not warn the user when it happens.

This is only true for users whose Airflow environment is somewhat optimized. There are a lot of configs to consider (parallelism, worker concurrency, pools) in order to limit task throughput -- surfacing more info to the user could only be helpful IMO.

I think this is more sutaible for cluster activity dashboard to expose this in actionable manner. Logs are a bit problematic because what you get here is notification that parallesim is reach but you have no information if user set this specific value by design or not. If user chose the specific value then logging warnings is not very user freidnly and in fact not actionable. This can result in further user request to silent the messages. In my point of view If we can find a way to present the information in cluster activity dashboard that would be prefered.

Why not both? I personally spend a lot more time debugging Airflow logs than I do using the cluster activity dashboard. If parallelism is reached unexpectedly such that it causes a problem, I'd definitely find it faster in the logs.

RNHTTR avatar Apr 29 '24 22:04 RNHTTR

My 2c, I think info is appropriate.

jedcunningham avatar Apr 29 '24 23:04 jedcunningham

This is only true for users whose Airflow environment is somewhat optimized. There are a lot of configs to consider (parallelism, worker concurrency, pools) in order to limit task throughput -- surfacing more info to the user could only be helpful IMO.

Exactly, in this case:

  1. It would be useless to warn the user when the pool limit or dag/run/task max concurrency are reached, because these limits are usually set according to the capacity of an external system
  2. Some users (including me) will find some of these warnings annoying especially that we can use the OTEL or statsd metrics to monitor the concurrency and create alerts.

hussein-awala avatar Apr 30 '24 19:04 hussein-awala

This is only true for users whose Airflow environment is somewhat optimized. There are a lot of configs to consider (parallelism, worker concurrency, pools) in order to limit task throughput -- surfacing more info to the user could only be helpful IMO.

Exactly, in this case:

  1. It would be useless to warn the user when the pool limit or dag/run/task max concurrency are reached, because these limits are usually set according to the capacity of an external system
  2. Some users (including me) will find some of these warnings annoying especially that we can use the OTEL or statsd metrics to monitor the concurrency and create alerts.

Yeah that's fair. Do you agree that at least an info log is reasonable?

RNHTTR avatar Apr 30 '24 22:04 RNHTTR

Okay I reverted a few logging statements so that now leaves an info log when AIRFLOW__CORE__PARALLELISM is reached, which should help users understand why Airflow isn't running any more tasks.

@hussein-awala Regarding:

It would be useless to warn the user when the pool limit or dag/run/task max concurrency are reached, because these limits are usually set according to the capacity of an external system

Disagree with this statement. Airflow has lots of knobs to tweak for parallelism which can be a blessing and a curse. You might have a perfect setup with OTEL integration, dashboards, and alerts, but I'm sure that knowledge and tooling wasn't developed overnight. My experience is that Airflow's default settings quite often limit the user before they start reaching hardware limits. This is a frustrating problem because as of today visibility (via UI/logs) on such limits is low. I'll investigate alternative options such as dashboards, but stating that it's "useless to warn the user" is not my experience since 9 out of 10 times I see people hitting a default limit instead of a user-defined limit.

Sidenote: we currently have one warning-level log regarding pool limits, should we downgrade that to INFO level? https://github.com/apache/airflow/blob/main/airflow/jobs/scheduler_job_runner.py#L450

BasPH avatar May 02 '24 09:05 BasPH

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jul 17 '24 00:07 github-actions[bot]

@BasPH aaah, thanks for the merge of this PR. Saw the logs first time today and reminded me that I need to fix a bug in my PoC for Remote Executor/AIP-69 :-D

jscheffl avatar Jul 17 '24 19:07 jscheffl