charts
charts copied to clipboard
Scheduler cpu usage.
Checks
- [X] I have checked for existing issues.
- [X] This report is about the user-community Airflow Helm Chart.
Chart Version
8.5.3
Kubernetes Version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:10:45Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.11-eks-f17b81", GitCommit:"f17b810c9e5a82200d28b6210b458497ddfcf31b", GitTreeState:"clean", BuildDate:"2021-10-15T21:46:21Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
Helm Version
version.BuildInfo{Version:"v3.7.2", GitCommit:"663a896f4a815053445eec4153677ddc24a0a361", GitTreeState:"clean", GoVersion:"go1.17.3"}
Description
Hello guys. I dont know if this behaviour its normal but the airflow scheduler use a lot of cpu. i use eks and m5n.large instances and the scheduler use more than 30% of cpu without any dag. (i thibk could be my dag, but i remove all and the cpu its almost the same.)
kubectl top pod -n airflow --sort-by=cpu --use-protocol-buffers=true NAME CPU(cores) MEMORY(bytes)
airflow-scheduler-5d5998697c-8vv7x 228m 194Mi
airflow-pgbouncer-86578bd5b6-vfjgn 10m 11Mi
airflow-flower-75c46d996f-n4pk7 9m 129Mi
airflow-postgresql-0 8m 39Mi
airflow-redis-master-0 8m 6Mi
airflow-web-5cc6d98f65-9pkj6 5m 1026Mi
airflow-worker-0 3m 973Mi
airflow-db-migrations-59fd758646-8qsrv 1m 218Mi
airflow-sync-users-57989db584-2gn4h 1m 219Mi
airflow-sync-variables-695d855c88-slvf4 1m 91Mi
Relevant Logs
____________ _____________
____ |__( )_________ __/__ /________ __
____ /| |_ /__ ___/_ /_ __ /_ __ \_ | /| / /
___ ___ | / _ / _ __/ _ / / /_/ /_ |/ |/ /
_/_/ |_/_/ /_/ /_/ /_/ \____/____/|__/
[2022-01-27 14:10:19,518] {scheduler_job.py:662} INFO - Starting the scheduler
[2022-01-27 14:10:19,518] {scheduler_job.py:667} INFO - Processing each file at most -1 times
[2022-01-27 14:10:20,606] {manager.py:254} INFO - Launched DagFileProcessorManager with pid: 50
[2022-01-27 14:10:20,608] {scheduler_job.py:1217} INFO - Resetting orphaned tasks for active dag runs
[2022-01-27 14:15:20,729] {scheduler_job.py:1217} INFO - Resetting orphaned tasks for active dag runs
[2022-01-27 14:20:20,865] {scheduler_job.py:1217} INFO - Resetting orphaned tasks for active dag runs
[2022-01-27 14:25:20,989] {scheduler_job.py:1217} INFO - Resetting orphaned tasks for active dag runs
[2022-01-27 14:30:21,126] {scheduler_job.py:1217} INFO - Resetting orphaned tasks for active dag runs
[2022-01-27 14:35:21,244] {scheduler_job.py:1217} INFO - Resetting orphaned tasks for active dag runs
[2022-01-27 14:40:21,367] {scheduler_job.py:1217} INFO - Resetting orphaned tasks for active dag runs
[2022-01-27 14:45:21,480] {scheduler_job.py:1217} INFO - Resetting orphaned tasks for active dag runs
[2022-01-27 14:50:21,587] {scheduler_job.py:1217} INFO - Resetting orphaned tasks for active dag runs
Custom Helm Values
airflow:
config:
AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL: 120
AIRFLOW__SCHEDULER__PROCESSOR_POLL_INTERVAL: 10
AIRFLOW__SCHEDULER__PARSING_PROCESSES: 4
AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL: 300
AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC: 10
dags:
gitSync:
enabled: true
repo: "[email protected]:xxxxx/data_team/etl/airflow.git"
branch: "xxxx"
sshSecret: "airflow-ssh-git-secret"
sshSecretKey: "id_rsa"
sshKnownHosts: |-
gitlab.com ssh-rsa xxxxx
webserverSecretKey: xxxxx
Ok i found something. I modified the Probe Period to 300s and the cpu usage goes from 250m-300m to 20m-30m
Why this probe take so much CPU? there is another way to execute this probe?
Stats after change the scheduler.livenessProbe.periodSeconds to 300
@csepulveda thanks for raising this!
Just a warning, setting scheduler.livenessProbe.periodSeconds
to 300
means your scheduler could be down for periodSeconds * failureThreshold
seconds, which is 300 * 5 = 1500
seconds or 25 minutes!
FYI: The code for the probe is found here.
Can you try running the probe manually on the scheduler pod to see what's slow?
- run
kubectl exec -i -t --namespace MY_NAMESPACE SCHEDULER_POD_NAME -- /bin/bash
, - run a
python
terminal - run the probe line by line, to find out which part is taking up the CPU (or is slow)
import sys
from typing import List
from airflow.jobs.scheduler_job import SchedulerJob
from airflow.utils.db import create_session
from airflow.utils.net import get_hostname
from airflow.utils.state import State
with create_session() as session:
hostname = get_hostname()
query = session \
.query(SchedulerJob) \
.filter_by(state=State.RUNNING, hostname=hostname) \
.order_by(SchedulerJob.latest_heartbeat.desc())
jobs: List[SchedulerJob] = query.all()
alive_jobs = [job for job in jobs if job.is_alive()]
count_alive_jobs = len(alive_jobs)
if count_alive_jobs == 1:
# scheduler is healthy - we expect one SchedulerJob per scheduler
pass
elif count_alive_jobs == 0:
sys.exit(f"UNHEALTHY - 0 alive SchedulerJob for: {hostname}")
else:
sys.exit(f"UNHEALTHY - {count_alive_jobs} (more than 1) alive SchedulerJob for: {hostname}")
@csepulveda have you had a chance to investigate the high CPU usage further using my comment https://github.com/airflow-helm/charts/issues/512#issuecomment-1027381894?
I would love to figure out what the problem is to understand if it affects other users.
Chart Version
8.5.3
Airflow Version
2.2.2
Kubernetes
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"windows/amd64"} Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.6+vmware.1", GitCommit:"088f01db2ffab397a290be443902918b59ee032c", GitTreeState:"clean", BuildDate:"2021-04-17T01:01:00Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"}
@thesuperzapper hi,
we experience similar issues, our dev environment is running one scheduler on 1000m+ with 15 simple DAGs running not so frequently.
NAME CPU(cores) MEMORY(bytes)
airflow-labs-scheduler-76785b597-6b4ll 1207m 328Mi
I have even increased MIN_FILE_PROCESS_INTERVAL from 60 to 300, but it's actually even worse.
AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL: "300"
AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC: "30"
AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC: "15"
AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL: "300"
AIRFLOW__SCHEDULER__SCHEDULER_HEALTH_CHECK_THRESHOLD: "120"
AIRFLOW__SCHEDULER__SCHEDULER_ZOMBIE_TASK_THRESHOLD: "600"
AIRFLOW__SCHEDULER__CATCHUP_BY_DEFAULT: "False"
AIRFLOW__SCHEDULER__FILE_PARSING_SORT_MODE: "modified_time"
AIRFLOW__SCHEDULER__PARSING_PROCESSES: "4"
Production & Pre-Production environments are about
NAME CPU(cores) MEMORY(bytes)
airflow-sit-scheduler-68799489db-cddbg 4593m 297Mi
airflow-sit-scheduler-68799489db-pcdd9 4102m 246Mi
airflow-sit-scheduler-68799489db-9wp4z 3759m 307Mi
...and more, which is crazy, I am really struggling to understand what is using so much CPU (yes there are 400+ DAGs but even so...).
I tested the liveness probe line by line, only one that took a bit longer (3-5s) is
from airflow.jobs.scheduler_job import SchedulerJob
following OP and setting Probe Period to 300s just from interest didn't affect this in any way.
@juroVee just to be clear, are you also encountering high CPU usage when scheduler.livenessProbe.enabled
is set to false
, or just when its true
?
If the probe causes high CPU usage, I imagine its because importing anything from airflow (like from airflow.jobs.scheduler_job import SchedulerJob
) results in quite a bit of processing, possibly even parsing all your DAG files and loading any plugins.
You could confirm if this is the case by running from airflow.jobs.scheduler_job import SchedulerJob
in a python shell with kubectl exec
and observing if the CPU usage spikes.
@csepulveda @juroVee I would love to know if you still encounter high CPU usage on the scheduler in version 8.6.0
of the chart.
There were some changes to the liveness probe and also added a log cleanup sidecar to the scheduler among other things.
@thesuperzapper We are running 8.6.0 on DEV only with minimum number of DAGs so I will get back to you after we move to PROD in about 1-2 weeks time. However, we managed to... let's say optimize CPU usage with setting AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL: "300". It still spikes to 4000m+ sometimes but in general it's more stable after that as the file processing is more 'spread'.
DEV environment with 8.6.0 is pretty stable right now, seems good. Will get back to you once we look again on mentioned spikes on PROD with 8.6.0.
@thesuperzapper So we run 8.6.0
in PROD for about a week and CPU seems to be indeed decreased (even more than after increasing AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL
) .
Values are averaging around following:
NAME CPU(cores) MEMORY(bytes)
airflow-scheduler-xxx 200m 243Mi
airflow-scheduler-yyy 135m 246Mi
airflow-scheduler-zzz 128m 256Mi
So it certainly helped, the difference is massive when utilizing 8.6.0
+ AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL: "300"
.
@juroVee that's great to hear, out of interest did you leave the new log-cleanup sidecar enabled?
Also, I assume you are using the same airflow version?
@thesuperzapper we use Airflow 2.2.5.
Well, regarding log-cleanup, you can find the answer for your question here #571 😃. So no, we don't have it enabled, we have it actually deleted from the chart itself due to that issue unfortunately.
Anyone know if this is fixed after the changes to the ScheduleJob (removed) healthcheck? https://github.com/apache/airflow/releases/tag/2.6.1