airflow Add metrics about task CPU and memory usage

These metrics send CPU and memory usage for each task. They are sent as gauge every second.

^ Add meaningful description above Read the Pull Request Guidelines for more information. In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed. In case of a new dependency, check compliance with the ASF 3rd Party License Policy. In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

May 15 '24 18:05 vincbeck

This is good. However, isn't it good idea to capture the utilization metrics of the entire pod (including sidecar containers) instead of just the base container?

It seems very related to Kubernetes? I am trying to come up with a solution compatible across all executor environments. If it is possible to have such solution that is also compatible with other executors I am all ears but I dont have enough experience with Kubernetes to come up with such solution. Or maybe as a follow up PR if you want to add that?

May 16 '24 16:05 vincbeck

Very cool! Left some comments.

Also is it possible to unit test this?

The only way I could find to unit test it is to check we are calling the function _read_task_utilization but I could not find a solution to actually test the function _read_task_utilization.

May 16 '24 17:05 vincbeck

Very cool! Left some comments. Also is it possible to unit test this?

The only way I could find to unit test it is to check we are calling the function _read_task_utilization but I could not find a solution to actually test the function _read_task_utilization.

Nevermind! I found a solution!

May 16 '24 19:05 vincbeck

Any more concerns/comments?

May 21 '24 15:05 vincbeck

@Taragolis

May 22 '24 17:05 vincbeck

Holy cardinality batman!

May 22 '24 19:05 ashb

Question about this PR: Memory and CPU are reported as a percentage of the available memory/CPU on the system, so to understand actual memory/CPU consumption (expressed in bytes/# of cores) you additionally need metrics on how much memory/CPU is available to the system.

However... even if I have such metrics on available resources, since this PR only reports consumption on a DAG and task level (not task instance/mapped task instance), I'm unsure how useful it is to link those up. Additionally, with tasks that can run on different hardware, we could see different percentages while multiple instances of a task could consume the same amount of resources.

Wouldn't it be more useful to report on psutil.virtual_memory().total * psutil.memory_percent() to get consumption in bytes/# of cores? That way we can compare apples with apples.

May 30 '24 09:05 BasPH

Question about this PR: Memory and CPU are reported as a percentage of the available memory/CPU on the system, so to understand actual memory/CPU consumption (expressed in bytes/# of cores) you additionally need metrics on how much memory/CPU is available to the system.

However... even if I have such metrics on available resources, since this PR only reports consumption on a DAG and task level (not task instance/mapped task instance), I'm unsure how useful it is to link those up. Additionally, with tasks that can run on different hardware, we could see different percentages while multiple instances of a task could consume the same amount of resources.

Wouldn't it be more useful to report on psutil.virtual_memory().total * psutil.memory_percent() to get consumption in bytes/# of cores? That way we can compare apples with apples.

If that's really a need, I would say let's report both metrics (percentage and actual number). I am pretty sure some folks rather have percentage metrics than actual number because they will have the opposite argument (knowing that a task consumes X memory is not really useful unless I know how much memory I got).

May 30 '24 13:05 vincbeck

I think (@howardyoo - @ferruzzi can you confirm?) the addition of traces, should make all the resource inormation automatically available if you enable it via Open-Telemetry (and traces will link the metrics about resources to tasks/dags automatically). From what I know OTEL has a way to enable all the "system"/ "python" etc. metrics out-of-the-box and the "traces" addition, shoudl (IMHO) label such metrics with appropriate labels for Airlfow "logical" tags - i.e. dags/task etc.

See https://github.com/apache/airflow/pull/37948

But maybe I am too optimistic there :) ?

Jun 01 '24 10:06 potiuk

I think (@howardyoo - @ferruzzi can you confirm?) the addition of traces, should make all the resource inormation automatically available if you enable it via Open-Telemetry (and traces will link the metrics about resources to tasks/dags automatically). From what I know OTEL has a way to enable all the "system"/ "python" etc. metrics out-of-the-box and the "traces" addition, shoudl (IMHO) label such metrics with appropriate labels for Airlfow "logical" tags - i.e. dags/task etc.

See #37948

But maybe I am too optimistic there :) ?

OpenTelemetry for Python SDK does provide 'auto-instrumentation' feature where it can automatically detect and produce traces, but those will not include metrics like cpu, memory usage, fs i/o, net i/o, processes, etc. However, we can definitely implement those as additional instrumentation, utilizing psutil package. It would also be very helpful if these metrics could also become part of the trace attribute, such that trace could also contain these as either values or span events, as needed, since when these metrics are being produced, they would highly likely be correlated with the task's execution - thus makes sense to have them existing during the task's duration.

My concern is that for some monitoring tools, this may introduce a high cardinality (as each individual task runs can be defined as independent sources for some tools), so we might want to have this turned on / off as part of the configuration.

Jun 01 '24 13:06 howardyoo