charts icon indicating copy to clipboard operation
charts copied to clipboard

Scheduler cpu usage.

Open csepulveda opened this issue 2 years ago • 12 comments

Checks

Chart Version

8.5.3

Kubernetes Version

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:10:45Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.11-eks-f17b81", GitCommit:"f17b810c9e5a82200d28b6210b458497ddfcf31b", GitTreeState:"clean", BuildDate:"2021-10-15T21:46:21Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

Helm Version

version.BuildInfo{Version:"v3.7.2", GitCommit:"663a896f4a815053445eec4153677ddc24a0a361", GitTreeState:"clean", GoVersion:"go1.17.3"}

Description

Hello guys. I dont know if this behaviour its normal but the airflow scheduler use a lot of cpu. i use eks and m5n.large instances and the scheduler use more than 30% of cpu without any dag. (i thibk could be my dag, but i remove all and the cpu its almost the same.)

kubectl top pod -n airflow --sort-by=cpu  --use-protocol-buffers=true             NAME                                      CPU(cores)   MEMORY(bytes)   
airflow-scheduler-5d5998697c-8vv7x        228m         194Mi           
airflow-pgbouncer-86578bd5b6-vfjgn        10m          11Mi            
airflow-flower-75c46d996f-n4pk7           9m           129Mi           
airflow-postgresql-0                      8m           39Mi            
airflow-redis-master-0                    8m           6Mi             
airflow-web-5cc6d98f65-9pkj6              5m           1026Mi          
airflow-worker-0                          3m           973Mi           
airflow-db-migrations-59fd758646-8qsrv    1m           218Mi           
airflow-sync-users-57989db584-2gn4h       1m           219Mi           
airflow-sync-variables-695d855c88-slvf4   1m           91Mi 

Relevant Logs

____________       _____________
 ____    |__( )_________  __/__  /________      __
____  /| |_  /__  ___/_  /_ __  /_  __ \_ | /| / /
___  ___ |  / _  /   _  __/ _  / / /_/ /_ |/ |/ /
 _/_/  |_/_/  /_/    /_/    /_/  \____/____/|__/
[2022-01-27 14:10:19,518] {scheduler_job.py:662} INFO - Starting the scheduler
[2022-01-27 14:10:19,518] {scheduler_job.py:667} INFO - Processing each file at most -1 times
[2022-01-27 14:10:20,606] {manager.py:254} INFO - Launched DagFileProcessorManager with pid: 50
[2022-01-27 14:10:20,608] {scheduler_job.py:1217} INFO - Resetting orphaned tasks for active dag runs
[2022-01-27 14:15:20,729] {scheduler_job.py:1217} INFO - Resetting orphaned tasks for active dag runs
[2022-01-27 14:20:20,865] {scheduler_job.py:1217} INFO - Resetting orphaned tasks for active dag runs
[2022-01-27 14:25:20,989] {scheduler_job.py:1217} INFO - Resetting orphaned tasks for active dag runs
[2022-01-27 14:30:21,126] {scheduler_job.py:1217} INFO - Resetting orphaned tasks for active dag runs
[2022-01-27 14:35:21,244] {scheduler_job.py:1217} INFO - Resetting orphaned tasks for active dag runs
[2022-01-27 14:40:21,367] {scheduler_job.py:1217} INFO - Resetting orphaned tasks for active dag runs
[2022-01-27 14:45:21,480] {scheduler_job.py:1217} INFO - Resetting orphaned tasks for active dag runs
[2022-01-27 14:50:21,587] {scheduler_job.py:1217} INFO - Resetting orphaned tasks for active dag runs

Custom Helm Values

airflow:

  config:
    AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL: 120
    AIRFLOW__SCHEDULER__PROCESSOR_POLL_INTERVAL: 10
    AIRFLOW__SCHEDULER__PARSING_PROCESSES: 4
    AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL: 300
    AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC: 10


dags:
  gitSync:
    enabled: true
    repo: "[email protected]:xxxxx/data_team/etl/airflow.git"
    branch: "xxxx"
    sshSecret: "airflow-ssh-git-secret"
    sshSecretKey: "id_rsa"
    sshKnownHosts: |-
      gitlab.com ssh-rsa xxxxx

webserverSecretKey: xxxxx

csepulveda avatar Jan 27 '22 14:01 csepulveda

Ok i found something. I modified the Probe Period to 300s and the cpu usage goes from 250m-300m to 20m-30m

Why this probe take so much CPU? there is another way to execute this probe?

csepulveda avatar Jan 27 '22 18:01 csepulveda

Stats after change the scheduler.livenessProbe.periodSeconds to 300

image

csepulveda avatar Jan 27 '22 19:01 csepulveda

@csepulveda thanks for raising this!

Just a warning, setting scheduler.livenessProbe.periodSeconds to 300 means your scheduler could be down for periodSeconds * failureThreshold seconds, which is 300 * 5 = 1500 seconds or 25 minutes!

FYI: The code for the probe is found here.

Can you try running the probe manually on the scheduler pod to see what's slow?

  1. run kubectl exec -i -t --namespace MY_NAMESPACE SCHEDULER_POD_NAME -- /bin/bash,
  2. run a python terminal
  3. run the probe line by line, to find out which part is taking up the CPU (or is slow)
import sys
from typing import List
from airflow.jobs.scheduler_job import SchedulerJob
from airflow.utils.db import create_session
from airflow.utils.net import get_hostname
from airflow.utils.state import State

with create_session() as session:
    hostname = get_hostname()
    query = session \
        .query(SchedulerJob) \
        .filter_by(state=State.RUNNING, hostname=hostname) \
        .order_by(SchedulerJob.latest_heartbeat.desc())
    jobs: List[SchedulerJob] = query.all()
    alive_jobs = [job for job in jobs if job.is_alive()]
    count_alive_jobs = len(alive_jobs)

if count_alive_jobs == 1:
    # scheduler is healthy - we expect one SchedulerJob per scheduler
    pass
elif count_alive_jobs == 0:
    sys.exit(f"UNHEALTHY - 0 alive SchedulerJob for: {hostname}")
else:
    sys.exit(f"UNHEALTHY - {count_alive_jobs} (more than 1) alive SchedulerJob for: {hostname}")

thesuperzapper avatar Feb 01 '22 23:02 thesuperzapper

@csepulveda have you had a chance to investigate the high CPU usage further using my comment https://github.com/airflow-helm/charts/issues/512#issuecomment-1027381894?

I would love to figure out what the problem is to understand if it affects other users.

thesuperzapper avatar Feb 18 '22 06:02 thesuperzapper

Chart Version

8.5.3

Airflow Version

2.2.2

Kubernetes

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"windows/amd64"} Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.6+vmware.1", GitCommit:"088f01db2ffab397a290be443902918b59ee032c", GitTreeState:"clean", BuildDate:"2021-04-17T01:01:00Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"}

@thesuperzapper hi,

we experience similar issues, our dev environment is running one scheduler on 1000m+ with 15 simple DAGs running not so frequently.

NAME CPU(cores) MEMORY(bytes) airflow-labs-scheduler-76785b597-6b4ll 1207m 328Mi

I have even increased MIN_FILE_PROCESS_INTERVAL from 60 to 300, but it's actually even worse.

AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL: "300"
AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC: "30"
AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC: "15"
AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL: "300"
AIRFLOW__SCHEDULER__SCHEDULER_HEALTH_CHECK_THRESHOLD: "120"
AIRFLOW__SCHEDULER__SCHEDULER_ZOMBIE_TASK_THRESHOLD: "600"
AIRFLOW__SCHEDULER__CATCHUP_BY_DEFAULT: "False"
AIRFLOW__SCHEDULER__FILE_PARSING_SORT_MODE: "modified_time"
AIRFLOW__SCHEDULER__PARSING_PROCESSES: "4"

Production & Pre-Production environments are about

NAME CPU(cores) MEMORY(bytes) airflow-sit-scheduler-68799489db-cddbg 4593m 297Mi airflow-sit-scheduler-68799489db-pcdd9 4102m 246Mi airflow-sit-scheduler-68799489db-9wp4z 3759m 307Mi

...and more, which is crazy, I am really struggling to understand what is using so much CPU (yes there are 400+ DAGs but even so...).

I tested the liveness probe line by line, only one that took a bit longer (3-5s) is

from airflow.jobs.scheduler_job import SchedulerJob

following OP and setting Probe Period to 300s just from interest didn't affect this in any way.

jurovee avatar Mar 04 '22 10:03 jurovee

@juroVee just to be clear, are you also encountering high CPU usage when scheduler.livenessProbe.enabled is set to false, or just when its true?

If the probe causes high CPU usage, I imagine its because importing anything from airflow (like from airflow.jobs.scheduler_job import SchedulerJob) results in quite a bit of processing, possibly even parsing all your DAG files and loading any plugins.

You could confirm if this is the case by running from airflow.jobs.scheduler_job import SchedulerJob in a python shell with kubectl exec and observing if the CPU usage spikes.

thesuperzapper avatar Mar 18 '22 07:03 thesuperzapper

@csepulveda @juroVee I would love to know if you still encounter high CPU usage on the scheduler in version 8.6.0 of the chart.

There were some changes to the liveness probe and also added a log cleanup sidecar to the scheduler among other things.

thesuperzapper avatar Apr 19 '22 23:04 thesuperzapper

@thesuperzapper We are running 8.6.0 on DEV only with minimum number of DAGs so I will get back to you after we move to PROD in about 1-2 weeks time. However, we managed to... let's say optimize CPU usage with setting AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL: "300". It still spikes to 4000m+ sometimes but in general it's more stable after that as the file processing is more 'spread'.

DEV environment with 8.6.0 is pretty stable right now, seems good. Will get back to you once we look again on mentioned spikes on PROD with 8.6.0.

jurovee avatar Apr 20 '22 10:04 jurovee

@thesuperzapper So we run 8.6.0 in PROD for about a week and CPU seems to be indeed decreased (even more than after increasing AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL) .

Values are averaging around following:

NAME                      CPU(cores)   MEMORY(bytes)
airflow-scheduler-xxx       200m         243Mi
airflow-scheduler-yyy       135m         246Mi
airflow-scheduler-zzz       128m         256Mi 

So it certainly helped, the difference is massive when utilizing 8.6.0 + AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL: "300".

jurovee avatar May 04 '22 07:05 jurovee

@juroVee that's great to hear, out of interest did you leave the new log-cleanup sidecar enabled?

Also, I assume you are using the same airflow version?

thesuperzapper avatar May 04 '22 08:05 thesuperzapper

@thesuperzapper we use Airflow 2.2.5.

Well, regarding log-cleanup, you can find the answer for your question here #571 😃. So no, we don't have it enabled, we have it actually deleted from the chart itself due to that issue unfortunately.

jurovee avatar May 04 '22 08:05 jurovee

Anyone know if this is fixed after the changes to the ScheduleJob (removed) healthcheck? https://github.com/apache/airflow/releases/tag/2.6.1

edify42 avatar Oct 12 '23 05:10 edify42