airflow
airflow copied to clipboard
Executor reports task instance (...) finished (failed) although the task says it's queued
Apache Airflow version
2.9.1
If "Other Airflow 2 version" selected, which one?
No response
What happened?
[2024-05-20T12:03:24.184+0000] {task_context_logger.py:91} ERROR - Executor reports task instance
<TaskInstance: (...) scheduled__2024-05-20T11:00:00+00:00 map_index=15 [queued]>
finished (failed) although the task says it's queued. (Info: None) Was the task killed externally?
What you think should happen instead?
No response
How to reproduce
I am not sure, unfortunately. But every day I see my tasks being killed randomly without good reasoning behind why it got killed/failed.
Operating System
Ubuntu 22.04.4 LTS
Versions of Apache Airflow Providers
apache-airflow==2.9.1
apache-airflow-providers-amazon==8.20.0
apache-airflow-providers-celery==3.6.2
apache-airflow-providers-cncf-kubernetes==8.1.1
apache-airflow-providers-common-io==1.3.1
apache-airflow-providers-common-sql==1.12.0
apache-airflow-providers-docker==3.10.0
apache-airflow-providers-elasticsearch==5.3.4
apache-airflow-providers-fab==1.0.4
apache-airflow-providers-ftp==3.8.0
apache-airflow-providers-google==10.17.0
apache-airflow-providers-grpc==3.4.1
apache-airflow-providers-hashicorp==3.6.4
apache-airflow-providers-http==4.10.1
apache-airflow-providers-imap==3.5.0
apache-airflow-providers-microsoft-azure==10.0.0
apache-airflow-providers-mongo==4.0.0
apache-airflow-providers-mysql==5.5.4
apache-airflow-providers-odbc==4.5.0
apache-airflow-providers-openlineage==1.7.0
apache-airflow-providers-postgres==5.10.2
apache-airflow-providers-redis==3.6.1
apache-airflow-providers-sendgrid==3.4.0
apache-airflow-providers-sftp==4.9.1
apache-airflow-providers-slack==8.6.2
apache-airflow-providers-smtp==1.6.1
apache-airflow-providers-snowflake==5.4.0
apache-airflow-providers-sqlite==3.7.1
apache-airflow-providers-ssh==3.10.1
Deployment
Docker-Compose
Deployment details
Client: Docker Engine - Community
Version: 26.1.3
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.14.0
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.27.0
Path: /usr/libexec/docker/cli-plugins/docker-compose
scan: Docker Scan (Docker Inc.)
Version: v0.23.0
Path: /usr/libexec/docker/cli-plugins/docker-scan
Server:
Containers: 30
Running: 25
Paused: 0
Stopped: 5
Images: 36
Server Version: 26.1.3
Storage Driver: overlay2
Backing Filesystem: btrfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: e377cd56a71523140ca6ae87e30244719194a521
runc version: v1.1.12-0-g51d5e94
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 5.15.0-107-generic
Operating System: Ubuntu 22.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 80
Total Memory: 62.33GiB
Name: troy
ID: UFMO:HODB:7MRE:7O2C:FLWN:HE4Y:EZDF:ZGNF:OZRW:BUTZ:DBQK:MFR2
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
OS: Ubuntu 22.04.4 LTS x86_64
Kernel: 5.15.0-107-generic
Uptime: 1 day, 23 hours, 12 mins
Packages: 847 (dpkg), 4 (snap)
Shell: fish 3.7.1
Resolution: 1024x768
Terminal: /dev/pts/0
CPU: Intel Xeon Silver 4316 (80) @ 3.400GHz
GPU: 03:00.0 Matrox Electronics Systems Ltd. Integrated
Memory: 24497MiB / 63830MiB
Anything else?
No response
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
I'm not sure there's an Airflow issue here.
My initial thought is that you are experiencing issues related to your workers and perhaps they are falling over due to resource issues, i.e. disk, ram?
I can see that you are using dynamic task mapping which, depending on what you are asking the workers to do, how many parallel tasks and the number of workers you have, could be overloading your capacity.
Not sure...it seems related to redis? I have seen other people report similar issues:
- https://github.com/apache/airflow/issues/26542#issuecomment-1913058540
- https://github.com/celery/celery/discussions/7276#discussioncomment-8160246
- https://github.com/apache/airflow/pull/31829
- https://github.com/celery/celery/issues/8030
- https://github.com/celery/celery/pull/8796
- https://github.com/celery/celery/issues/8845
Also, a lot of DAGs are failing within the same reason, so that's not entirely tied to Task Mapping at all. Some tasks fail very early...also this server has a lot of RAM, of which I've granted ~12gb to each worker and the task is very simple, just HTTP requests, all of them run in less than 2 minutes when they don't fail.
I think the log you shared (source) erroneously replaced the "stuck in queued" log somehow. Can you check your scheduler logs for "stuck in queued"?
@RNHTTR there's nothing stating "stuck in queued" on scheduler logs.
same issue here
I had the same issue when running hundreds of sensors on reschedule mode - a lot of the times they got stuck in the queued status and raised the same error that you posted. It turned out that our redis pod used by Celery restarted quite often and lost the info about queued tasks. Adding persistence to redis seems to have helped. Do you have persistence enabled?
I had the same issue when running hundreds of sensors on reschedule mode - a lot of the times they got stuck in the queued status and raised the same error that you posted. It turned out that our redis pod used by Celery restarted quite often and lost the info about queued tasks. Adding persistence to redis seems to have helped. Do you have persistence enabled?
Can you help me how to add this persistence?
Hi @nghilethanh-atherlabs I've been experimenting with those configs as well:
# airflow.cfg
# https://airflow.apache.org/docs/apache-airflow-providers-celery/stable/configurations-ref.html#task-acks-late
# https://github.com/apache/airflow/issues/16163#issuecomment-1563704852
task_acks_late = False
# https://github.com/apache/airflow/blob/2b6f8ffc69b5f34a1c4ab7463418b91becc61957/airflow/providers/celery/executors/default_celery.py#L52
# https://github.com/celery/celery/discussions/7276#discussioncomment-8720263
# https://github.com/celery/celery/issues/4627#issuecomment-396907957
[celery_broker_transport_options]
visibility_timeout = 300
max_retries = 120
interval_start = 0
interval_step = 0.2
interval_max = 0.5
# sentinel_kwargs = {}
For the redis persistency, you can refer to their config file to enable persistency. Not sure it will sort out. But let's keep trying folks.
# redis.conf
bind 0.0.0.0
protected-mode no
requirepass REDACTED
maxmemory 6gb
# https://redis.io/docs/manual/eviction/
maxmemory-policy noeviction
port 6379
tcp-backlog 511
timeout 0
tcp-keepalive 300
daemonize no
supervised no
pidfile /var/run/redis.pid
loglevel notice
logfile ""
databases 16
always-show-logo no
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
dir /bitnami/redis/data
appendonly no
appendfilename "appendonly.aof"
appendfsync everysec
# appendfsync no
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-load-truncated yes
aof-use-rdb-preamble no
aof-rewrite-incremental-fsync yes
lua-time-limit 5000
slowlog-log-slower-than 10000
slowlog-max-len 128
latency-monitor-threshold 0
notify-keyspace-events ""
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-size -2
list-compress-depth 0
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
hll-sparse-max-bytes 3000
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 256mb 64mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
hz 10
# docker-compose.yml
redis:
image: bitnami/redis:7.2.5
container_name: redis
environment:
- REDIS_DISABLE_COMMANDS=CONFIG
# The password will come from the config file, but we need to bypass the validation
- ALLOW_EMPTY_PASSWORD=yes
ports:
- 6379:6379
# command: /opt/bitnami/scripts/redis/run.sh --maxmemory 2gb
command: /opt/bitnami/scripts/redis/run.sh
volumes:
- ./redis/redis.conf:/opt/bitnami/redis/mounted-etc/redis.conf
- redis:/bitnami/redis/data
restart: always
healthcheck:
test:
- CMD
- redis-cli
- ping
interval: 5s
timeout: 30s
retries: 10
Seeing this issue on 2.9.1 as well, also only with sensors.
We've found that the DAG is timing out trying to fill up the Dagbag on the worker. Even with debug logs enabled I don't have a hint about where in the import it's hanging.
[2024-05-31 18:00:01,335: INFO/ForkPoolWorker-63] Filling up the DagBag from <redacted dag file path>
[2024-05-31 18:00:01,350: DEBUG/ForkPoolWorker-63] Importing <redacted dag file path>
[2024-05-31 18:00:31,415: ERROR/ForkPoolWorker-63] Process timed out, PID: 314
On the scheduler the DAG imports in less than a second.
and not all the tasks from this DAG fail to import, many import just fine, at the same time on the same celery worker. below is the same dag file as above, importing fine:
[2024-05-31 18:01:52,911: INFO/ForkPoolWorker-3] Filling up the DagBag from <redacted dag file path>
[2024-05-31 18:01:52,913: DEBUG/ForkPoolWorker-3] Importing <redacted dag file path>
[2024-05-31 18:01:54,232: WARNING/ForkPoolWorker-3] /usr/local/lib/python3.11/site-packages/airflow/models/baseoperator.py:484: RemovedInAirflow3Warning: The 'task_concurrency' parameter is deprecated. Please use 'max_active_tis_per_dag'.
result = func(self, **kwargs, default_args=default_args)
[2024-05-31 18:01:54,272: DEBUG/ForkPoolWorker-3] Loaded DAG <DAG: redacted dag>
one caveat/note is that it looks like the 2nd run/retry of each sensor is what runs just fine.
We've also confirmed this behavior was not present on Airflow 2.7.3, and only started occurring since upgrading to 2.9.1.
@andreyvital thank you so much for your response. I have setup and it works really great :)
I was working on the issue with @seanmuth and increasing parsing time solved the issue. It does not fix the root cause, but as a workaround it can save your night...
AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT = 120
Hello everyone,
I'm currently investigating this issue, but I haven't been able to replicate it yet. Could you please try setting AIRFLOW__CORE__EXECUTE_TASKS_NEW_PYTHON_INTERPRETER=True [1] to see if we can generate more error logs? It seems that _execute_in_subprocess generates more error logs compared to _execute_in_fork, which might provide us with some additional clues.
https://github.com/apache/airflow/blob/2d53c1089f78d8d1416f51af60e1e0354781c661/airflow/providers/celery/executors/celery_executor_utils.py#L187-L188
[1] https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#execute-tasks-new-python-interpreter
Spotted same problem with Airflow 2.9.1 - problem didn't occur earlier so it's strictly related with this version. It happens randomly on random task execution. Restarting scheduler and triggerer helps - but this is our temp workaround.
Spotted same problem with Airflow 2.9.1 - problem didn't occur earlier so it's strictly related with this version. It happens randomly on random task execution. Restarting scheduler and triggerer helps - but this is our temp workaround.
We've released apache-airflow-providers-celery 3.7.2 with enhanced logging. Could you please update the provider version and check the debug log for any clues? Additionally, what I mentioned in https://github.com/apache/airflow/issues/39717#issuecomment-2148697763 might give us some club as well. Thanks!
Following... and adding some spice.
We have just upgraded to Airflow 2.9.2 and also have the (same) issue. Yet we have seen the problem in Airflow 2.8 (in our case the celery task airflow.exceptions.AirflowException: Celery command failed on host: slautop02 with celery_task_id 5d7f577d-3e89-4867-8481-24df778346ae (PID: 815333, Return Code: 256) but the Airflow tasks did not fail.
After reading this issue I also caugth this on shceduler logs: [2024-06-20T17:45:58.167+0100] {processor.py:161} INFO - Started process (PID=830424) to work on /home/ttauto/airflow/dags/ITSM_DAILY_INFORM_CLOSE_TASKS.py [2024-06-20T17:45:58.169+0100] {processor.py:830} INFO - Processing file /home/ttauto/airflow/dags/ITSM_DAILY_INFORM_CLOSE_TASKS.py for tasks to queue [2024-06-20T17:45:58.170+0100] {logging_mixin.py:188} INFO - [2024-06-20T17:45:58.170+0100] {dagbag.py:545} INFO - Filling up the DagBag from /home/ttauto/airflow/dags/ITSM_DAILY_INFORM_CLOSE_TASKS.py [2024-06-20T17:46:28.174+0100] {logging_mixin.py:188} INFO - [2024-06-20T17:46:28.173+0100] {timeout.py:68} ERROR - Process timed out, PID: 830424
Despite that these timeouts apear on several dags, we see no errors on the airflow ui neither on the airflow tasks We also cannot match the Pid in this logs with the pid mentioned on the celery tasks (pid XXX return code 256)
We are experiencing Celery tasks failures with the following stack trace: Traceback (most recent call last): File "/opt/tkapp/env_airflow/lib64/python3.9/site-packages/celery/app/trace.py", line 453, in trace_task R = retval = fun(*args, **kwargs) File "/opt/tkapp/env_airflow/lib64/python3.9/site-packages/celery/app/trace.py", line 736, in protected_call return self.run(*args, **kwargs) File "/opt/tkapp/env_airflow/lib64/python3.9/site-packages/airflow/providers/celery/executors/celery_executor_utils.py", line 136, in execute_command _execute_in_fork(command_to_exec, celery_task_id) File "/opt/tkapp/env_airflow/lib64/python3.9/site-packages/airflow/providers/celery/executors/celery_executor_utils.py", line 151, in _execute_in_fork raise AirflowException(msg) airflow.exceptions.AirflowException: Celery command failed on host: slautop02 with celery_task_id 5d7f577d-3e89-4867-8481-24df778346ae (PID: 815333, Return Code: 256)
Most of the times, this does not raise any issues and the dags tasks complete successfully without problems, even if the CELERY task is marked as failed, the airflow tasks completes successefully. Today we had a dag failure on the very first task ( an emptyoperator ) wit the exact same problem in the celery task. So the problem is now a real issue for us.
*** Found local files: *** * /opt/tkapp/airflow/logs/dag_id=CSDISPATCHER_SIMPLES/run_id=scheduled__2024-06-20T16:34:00+00:00/task_id=Start/attempt=1.log.SchedulerJob.log [2024-06-20, 17:39:30 WEST] {scheduler_job_runner.py:843} ERROR - Executor reports task instance <TaskInstance: CSDISPATCHER_SIMPLES.Start scheduled__2024-06-20T16:34:00+00:00 [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally? [2024-06-20, 17:50:47 WEST] {event_scheduler.py:40} WARNING - Marking task instance <TaskInstance: CSDISPATCHER_SIMPLES.Start scheduled__2024-06-20T16:34:00+00:00 [queued]> stuck in queued as failed. If the task instance has available retries, it will be retried. [2024-06-20, 17:50:48 WEST] {scheduler_job_runner.py:843} ERROR - Executor reports task instance <TaskInstance: CSDISPATCHER_SIMPLES.Start scheduled__2024-06-20T16:34:00+00:00 [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally?
We have investigated the (Return Code: 256) but without success, the "best" reason would be memory contention on the server but we also do not observe that.
Our server status, no exhaustion of resources.
version in use: apache-airflow==2.9.2 apache-airflow-providers-celery==3.7.2 apache-airflow-providers-common-io==1.3.2 apache-airflow-providers-common-sql==1.14.0 apache-airflow-providers-fab==1.1.1 apache-airflow-providers-ftp==3.9.1 apache-airflow-providers-hashicorp==3.7.1 apache-airflow-providers-http==4.11.1 apache-airflow-providers-imap==3.6.1 apache-airflow-providers-postgres==5.11.1 apache-airflow-providers-sftp==4.10.1 apache-airflow-providers-smtp==1.7.1 apache-airflow-providers-sqlite==3.8.1 apache-airflow-providers-ssh==3.11.1
We have just changed the AIRFLOW__CORE__EXECUTE_TASKS_NEW_PYTHON_INTERPRETER=True to try to get some more info.
Can you try to set https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#schedule-after-task-execution to False and see if it helps @trlopes1974 ?
I see the same issue, with dynamic task mapping in multiple instances of a DAG. All the pods have enough cpu-memory
Executor: CeleryKubernetes Airflow version: 2.9.1 Redis persistence: enabled DAG: Dynamic task group with dynamic tasks and multiple instances of the DAG may run at a time
when I re-run the failed tasks with this error, it goes through and finishes successfully
@vizeit and anyone looking here and tempted to report "I have the same issue". PLEASE before doing it upgrade to 2.9.2 and latest celery provider. And when you do, report it here whether things are fixed, and if not, add logs from the celery executor.
If you actually look at the discussion - some of related issues were fixed in 2.9.2 and Celery logging has been improved in latest provider to add more information. So the best thing you can do - is not really post "i have the same issue" but upgrade and let us know if it helped, and second best thing is to upgrade celery provider and post relevant logs.
Just posting "I have the same issue in 2.9.1" is not moving a needle when it comes to investigating and fixing such problem.
Sure, I can upgrade and check. I believe others here already tested on 2.9.2 reporting the same issue
Following... and adding some spice.
We have just upgraded to Airflow 2.9.2 and also have the (same) issue. Yet we have seen the problem in Airflow 2.8 (in our case the celery task airflow.exceptions.AirflowException: Celery command failed on host: slautop02 with celery_task_id 5d7f577d-3e89-4867-8481-24df778346ae (PID: 815333, Return Code: 256) but the Airflow tasks did not fail.
After reading this issue I also caugth this on shceduler logs: [2024-06-20T17:45:58.167+0100] {processor.py:161} INFO - Started process (PID=830424) to work on /home/ttauto/airflow/dags/ITSM_DAILY_INFORM_CLOSE_TASKS.py [2024-06-20T17:45:58.169+0100] {processor.py:830} INFO - Processing file /home/ttauto/airflow/dags/ITSM_DAILY_INFORM_CLOSE_TASKS.py for tasks to queue [2024-06-20T17:45:58.170+0100] {logging_mixin.py:188} INFO - [2024-06-20T17:45:58.170+0100] {dagbag.py:545} INFO - Filling up the DagBag from /home/ttauto/airflow/dags/ITSM_DAILY_INFORM_CLOSE_TASKS.py [2024-06-20T17:46:28.174+0100] {logging_mixin.py:188} INFO - [2024-06-20T17:46:28.173+0100] {timeout.py:68} ERROR - Process timed out, PID: 830424
Despite that these timeouts apear on several dags, we see no errors on the airflow ui neither on the airflow tasks We also cannot match the Pid in this logs with the pid mentioned on the celery tasks (pid XXX return code 256)
We are experiencing Celery tasks failures with the following stack trace: Traceback (most recent call last): File "/opt/tkapp/env_airflow/lib64/python3.9/site-packages/celery/app/trace.py", line 453, in trace_task R = retval = fun(*args, **kwargs) File "/opt/tkapp/env_airflow/lib64/python3.9/site-packages/celery/app/trace.py", line 736, in protected_call return self.run(*args, **kwargs) File "/opt/tkapp/env_airflow/lib64/python3.9/site-packages/airflow/providers/celery/executors/celery_executor_utils.py", line 136, in execute_command _execute_in_fork(command_to_exec, celery_task_id) File "/opt/tkapp/env_airflow/lib64/python3.9/site-packages/airflow/providers/celery/executors/celery_executor_utils.py", line 151, in _execute_in_fork raise AirflowException(msg) airflow.exceptions.AirflowException: Celery command failed on host: slautop02 with celery_task_id 5d7f577d-3e89-4867-8481-24df778346ae (PID: 815333, Return Code: 256)
Most of the times, this does not raise any issues and the dags tasks complete successfully without problems, even if the CELERY task is marked as failed, the airflow tasks completes successefully. Today we had a dag failure on the very first task ( an emptyoperator ) wit the exact same problem in the celery task. So the problem is now a real issue for us.
*** Found local files: *** * /opt/tkapp/airflow/logs/dag_id=CSDISPATCHER_SIMPLES/run_id=scheduled__2024-06-20T16:34:00+00:00/task_id=Start/attempt=1.log.SchedulerJob.log [2024-06-20, 17:39:30 WEST] {scheduler_job_runner.py:843} ERROR - Executor reports task instance <TaskInstance: CSDISPATCHER_SIMPLES.Start scheduled__2024-06-20T16:34:00+00:00 [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally? [2024-06-20, 17:50:47 WEST] {event_scheduler.py:40} WARNING - Marking task instance <TaskInstance: CSDISPATCHER_SIMPLES.Start scheduled__2024-06-20T16:34:00+00:00 [queued]> stuck in queued as failed. If the task instance has available retries, it will be retried. [2024-06-20, 17:50:48 WEST] {scheduler_job_runner.py:843} ERROR - Executor reports task instance <TaskInstance: CSDISPATCHER_SIMPLES.Start scheduled__2024-06-20T16:34:00+00:00 [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally?
We have investigated the (Return Code: 256) but without success, the "best" reason would be memory contention on the server but we also do not observe that.
Our server status, no exhaustion of resources.
version in use: apache-airflow==2.9.2 apache-airflow-providers-celery==3.7.2 apache-airflow-providers-common-io==1.3.2 apache-airflow-providers-common-sql==1.14.0 apache-airflow-providers-fab==1.1.1 apache-airflow-providers-ftp==3.9.1 apache-airflow-providers-hashicorp==3.7.1 apache-airflow-providers-http==4.11.1 apache-airflow-providers-imap==3.6.1 apache-airflow-providers-postgres==5.11.1 apache-airflow-providers-sftp==4.10.1 apache-airflow-providers-smtp==1.7.1 apache-airflow-providers-sqlite==3.8.1 apache-airflow-providers-ssh==3.11.1
We have just changed the AIRFLOW__CORE__EXECUTE_TASKS_NEW_PYTHON_INTERPRETER=True to try to get some more info.
Does setting the log level to debug help? we might be able to get the log here
Sure, I can upgrade and check. I believe others here already tested on 2.9.2 reporting the same issue
Sometimes similar issues are not the same issues, and upgrading to latest version of Airflow and checking there saves a lot of effort to voluntary people who want to help find issues, if the issue has been solved already, so this is the least effort you can do to help with it.
Not mentioning that lates versions (including 2.9.2) has latest fixes (including security fixes) - so well. it's in the best interest of yours to upgrade
some more info:
We did not set schedule_after_task_execution=False as we altered the setting: execute_tasks_new_python_interpreter = True and wanted to see if it helped.
we had tis info. dag run CS00007002_Correcao_Dados_Oracle_WO0000000808061_2024-06-27T16:19:17.211167+01:00
error in celery ( flower)
Traceback (most recent call last):
File "/opt/tkapp/env_airflow/lib64/python3.9/site-packages/airflow/providers/celery/executors/celery_executor_utils.py", line 192, in _execute_in_subprocess
subprocess.check_output(command_to_exec, stderr=subprocess.STDOUT, close_fds=True, env=env)
File "/usr/lib64/python3.9/subprocess.py", line 424, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/usr/lib64/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['airflow', 'tasks', 'run', 'CS00007002_Correcao_Dados_Oracle', 'remote_actions.ssh_command_remove_operator', 'CS00007002_Correcao_Dados_Oracle_WO0000000808061_2024-06-27T16:19:17.211167+01:00', '--local', '--subdir', 'DAGS_FOLDER/CS00007002_Correcao_Dados_Oracle.py']' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/tkapp/env_airflow/lib64/python3.9/site-packages/celery/app/trace.py", line 453, in trace_task
R = retval = fun(*args, **kwargs)
File "/opt/tkapp/env_airflow/lib64/python3.9/site-packages/celery/app/trace.py", line 736, in __protected_call__
return self.run(*args, **kwargs)
File "/opt/tkapp/env_airflow/lib64/python3.9/site-packages/airflow/providers/celery/executors/celery_executor_utils.py", line 134, in execute_command
_execute_in_subprocess(command_to_exec, celery_task_id)
File "/opt/tkapp/env_airflow/lib64/python3.9/site-packages/airflow/providers/celery/executors/celery_executor_utils.py", line 197, in _execute_in_subprocess
raise AirflowException(msg)
airflow.exceptions.AirflowException: Celery command failed on host: slautop02 with celery_task_id d5489483-fbfc-4943-868d-f058c8c0d8d0
Dag Status is OK, no failure and all tasks completed successfuly.
The corresponding Airflow Task log: remote_actions.ssh_command_remove_operator
Task Instance: remote_actions.ssh_command_remove_operator at 2024-06-27, 16:19:17
grid_on Grid
details Task Instance Details
code Rendered Template
reorder Log
sync_alt XCom
Log by attempts
1
*** Found local files:
*** * /opt/tkapp/airflow/logs/dag_id=CS00007002_Correcao_Dados_Oracle/run_id=CS00007002_Correcao_Dados_Oracle_WO0000000808061_2024-06-27T16:19:17.211167+01:00/task_id=remote_actions.ssh_command_remove_operator/attempt=1.log
[2024-06-27, 16:21:07 WEST] {local_task_job_runner.py:120} ▼ Pre task execution logs
[2024-06-27, 16:21:07 WEST] {taskinstance.py:2076} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: CS00007002_Correcao_Dados_Oracle.remote_actions.ssh_command_remove_operator CS00007002_Correcao_Dados_Oracle_WO0000000808061_2024-06-27T16:19:17.211167+01:00 [queued]>
[2024-06-27, 16:21:07 WEST] {taskinstance.py:2076} INFO - Dependencies all met for dep_context=requeueable deps ti=<TaskInstance: CS00007002_Correcao_Dados_Oracle.remote_actions.ssh_command_remove_operator CS00007002_Correcao_Dados_Oracle_WO0000000808061_2024-06-27T16:19:17.211167+01:00 [queued]>
[2024-06-27, 16:21:07 WEST] {taskinstance.py:2306} INFO - Starting attempt 1 of 1
[2024-06-27, 16:21:07 WEST] {taskinstance.py:2330} INFO - Executing <Task(SSHOperator): remote_actions.ssh_command_remove_operator> on 2024-06-27 15:19:17.213772+00:00
[2024-06-27, 16:21:07 WEST] {standard_task_runner.py:63} INFO - Started process 2545260 to run task
[2024-06-27, 16:21:07 WEST] {standard_task_runner.py:90} INFO - Running: ['airflow', 'tasks', 'run', 'CS00007002_Correcao_Dados_Oracle', 'remote_actions.ssh_command_remove_operator', 'CS00007002_Correcao_Dados_Oracle_WO0000000808061_2024-06-27T16:19:17.211167+01:00', '--job-id', '369573', '--raw', '--subdir', 'DAGS_FOLDER/CS00007002_Correcao_Dados_Oracle.py', '--cfg-path', '/tmp/tmpsp8huiqu']
[2024-06-27, 16:21:07 WEST] {standard_task_runner.py:91} INFO - Job 369573: Subtask remote_actions.ssh_command_remove_operator
[2024-06-27, 16:21:07 WEST] {task_command.py:426} INFO - Running <TaskInstance: CS00007002_Correcao_Dados_Oracle.remote_actions.ssh_command_remove_operator CS00007002_Correcao_Dados_Oracle_WO0000000808061_2024-06-27T16:19:17.211167+01:00 [running]> on host SERVERNAME
[2024-06-27, 16:21:07 WEST] {taskinstance.py:2648} INFO - Exporting env vars: AIRFLOW_CTX_DAG_EMAIL='TESTEMAIL@EMAILTEST>' AIRFLOW_CTX_DAG_OWNER='ttauto' AIRFLOW_CTX_DAG_ID='CS00007002_Correcao_Dados_Oracle' AIRFLOW_CTX_TASK_ID='remote_actions.ssh_command_remove_operator' AIRFLOW_CTX_EXECUTION_DATE='2024-06-27T15:19:17.213772+00:00' AIRFLOW_CTX_TRY_NUMBER='1' AIRFLOW_CTX_DAG_RUN_ID='CS00007002_Correcao_Dados_Oracle_WO0000000808061_2024-06-27T16:19:17.211167+01:00'
[2024-06-27, 16:21:07 WEST] {taskinstance.py:430} ▲▲▲ Log group end
[2024-06-27, 16:21:07 WEST] {ssh.py:151} INFO - Creating ssh_client
[2024-06-27, 16:21:07 WEST] {ssh.py:302} WARNING - No Host Key Verification. This won't protect against Man-In-The-Middle attacks
[2024-06-27, 16:21:07 WEST] {transport.py:1909} INFO - Connected (version 2.0, client OpenSSH_8.0)
[2024-06-27, 16:21:07 WEST] {transport.py:1909} INFO - Auth banner: b"################################### AVISO #########################################\n\nOs sistemas internos TARGET SYSTEM so' devem ser usados para realizar atividades de \nnegocio do TARGET SYSTEM ou outros fins autorizados pela Direcao do TARGET SYSTEM \xe2\x80\x93 \nDepartamento de Sistemas de Informacao (DSI)\n\n===============================================================================\n= O DIREITO DE ACESSO A ESTE SISTEMA E' RESERVADO !! =\n===============================================================================\n= =\n= Este sistema deve ser utilizado apenas em actividades de negocio =\n= autorizadas pela Gestao do TARGET SYSTEM . =\n= =\n===============================================================================\n= Este sistema esta' sujeito a auditorias efectuadas a qualquer momento. =\n===============================================================================\n"
[2024-06-27, 16:21:07 WEST] {transport.py:1909} INFO - Authentication (publickey) successful!
[2024-06-27, 16:21:07 WEST] {ssh.py:483} INFO - Running command: sudo -s -- eval 'su - SOMEUSER -c "/home/SOMEUSER/correcao_dados/correcao_dados.sh WO0000000808061 REMOVE"'
[2024-06-27, 16:21:07 WEST] {ssh.py:529} INFO - a remover /tmp/correcao_dados/WO0000000808061
[2024-06-27, 16:21:07 WEST] {ssh.py:529} INFO - ACCAO 'REMOVE' EXECUTADA PARA A WORKORDERID:'WO0000000808061'
[2024-06-27, 16:21:07 WEST] {taskinstance.py:441} ▼ Post task execution logs
[2024-06-27, 16:21:08 WEST] {taskinstance.py:1206} INFO - Marking task as SUCCESS. dag_id=CS00007002_Correcao_Dados_Oracle, task_id=remote_actions.ssh_command_remove_operator, run_id=CS00007002_Correcao_Dados_Oracle_WO0000000808061_2024-06-27T16:19:17.211167+01:00, execution_date=20240627T151917, start_date=20240627T152107, end_date=20240627T152108
[2024-06-27, 16:21:08 WEST] {local_task_job_runner.py:240} INFO - Task exited with return code 0
[2024-06-27, 16:21:08 WEST] {local_task_job_runner.py:222} ▲▲▲ Log group end
Version: v2.9.2
Git Version: .release:f56f13442613912725d307aafc537cc76277c2d1
details:
In the same dag run another celery task also has a failed state with the exact same error...
Traceback (most recent call last):
File "/opt/tkapp/env_airflow/lib64/python3.9/site-packages/airflow/providers/celery/executors/celery_executor_utils.py", line 192, in _execute_in_subprocess
subprocess.check_output(command_to_exec, stderr=subprocess.STDOUT, close_fds=True, env=env)
File "/usr/lib64/python3.9/subprocess.py", line 424, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/usr/lib64/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['airflow', 'tasks', 'run', 'CS00007002_Correcao_Dados_Oracle', 'remote_actions.get_files_from_sftp', 'CS00007002_Correcao_Dados_Oracle_WO0000000808061_2024-06-27T16:19:17.211167+01:00', '--local', '--subdir', 'DAGS_FOLDER/CS00007002_Correcao_Dados_Oracle.py']' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/tkapp/env_airflow/lib64/python3.9/site-packages/celery/app/trace.py", line 453, in trace_task
R = retval = fun(*args, **kwargs)
File "/opt/tkapp/env_airflow/lib64/python3.9/site-packages/celery/app/trace.py", line 736, in __protected_call__
return self.run(*args, **kwargs)
File "/opt/tkapp/env_airflow/lib64/python3.9/site-packages/airflow/providers/celery/executors/celery_executor_utils.py", line 134, in execute_command
_execute_in_subprocess(command_to_exec, celery_task_id)
File "/opt/tkapp/env_airflow/lib64/python3.9/site-packages/airflow/providers/celery/executors/celery_executor_utils.py", line 197, in _execute_in_subprocess
raise AirflowException(msg)
airflow.exceptions.AirflowException: Celery command failed on host: slautop02 with celery_task_id d97fac72-aed7-49a5-8fe0-b4696d418431
We are observing something interesting, might be a coincidence or not, but all these errors seem to be related to SSHOperator or SFTPOperator direct usage. We have other dags where the SSHOperator or SFTPOperator are inside a PytonOperator and we see no errors on those tasks ( in celery, remeber that Airflow is not complaining on these).
Anything else I should look for?
Hello everyone,
I'm currently investigating this issue, but I haven't been able to replicate it yet. Could you please try setting
AIRFLOW__CORE__EXECUTE_TASKS_NEW_PYTHON_INTERPRETER=True[1] to see if we can generate more error logs? It seems that_execute_in_subprocessgenerates more error logs compared to_execute_in_fork, which might provide us with some additional clues.https://github.com/apache/airflow/blob/2d53c1089f78d8d1416f51af60e1e0354781c661/airflow/providers/celery/executors/celery_executor_utils.py#L187-L188
[1] https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#execute-tasks-new-python-interpreter
Man, you saved my day. I don't know how but your recommendation to set AIRFLOW__CORE__EXECUTE_TASKS_NEW_PYTHON_INTERPRETER to True fixed my problem. I am running airflow locally on my Apple M3 Pro machine. I've spotted that when I use boto3.client(...) code in one of the Task within my DAG then even very simple PythonOperator which does printing to console throws the error:
[2024-06-30T15:16:06.022+0200] {scheduler_job_runner.py:843} ERROR - Executor reports task instance <TaskInstance: my_dag.get_params manual__2024-06-30T13:15:42+00:00 [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally?
I am using LocalExecutor with Postgres. That boto3.client code is used in a downstream task but the upstream task which is very simple print statement failed with that command. I had no meaningful logs and ChatGPT started to hate me about my questions. Then I started searching through the web and landed on your comment. And now after I've applied your suggestion I don't have that issue anymore. Thank you!
This is driving me nuts...
Airflow Task Status: = Success CS00007002_Correcao_Dados_Oracle | 2024-07-01T10:54:20.981538+00:00 | remote_actions.ssh_command_remove_operator | success | 2024-07-01T10:55:56.350026+00:00 | 2024-07-01T10:55:57.277247+00:00
Celery Task Status:= Failure
How can the celery task be marked as failed but the airflow task has a success status?
How can the celery task be marked as failed but the airflow task has a success status?
There are few things that happens in celery process AFTER task is marked as successful - one of them is controlled by https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#schedule-after-task-execution - which is "mini-scheduling" that happens in this process. So one of the ways you can see if this is the cause - is to disable it. Another - there was a lock problem detected and fixed in one of the most recent versions of Airlfow, that was starting to happen more for dynamically mapped tasks - so maybe upgrading Airflow might help as well.
@potiuk . I'll try this on our dev environement. We are already at Airflow 2.9.2...
We do see a few errors of this kind too, with an Airflow v2.9.2 in Kubernetes + Celery workers + Redis OSS 7.0.7 (AWS Elasticache).
We do see a few errors of this kind too, with an Airflow v2.9.2 in Kubernetes + Celery workers + Redis OSS 7.0.7 (AWS Elasticache).
Does it help if you disable "schedule after task execution"?
Does it help if you disable "schedule after task execution"?
Unfortunately, in our case we rely on the feature for some DAGs with many sequential tasks, and the tradeoff would not be welcomed by our IT teams (schedule_after_task_execution was a v1 -> v2 migration seller, on top of the security incentive, for our IT teams :) ).
Unfortunately, in our case we rely on the feature for some DAGs with many sequential tasks, and the tradeoff would not be welcomed by our IT teams (schedule_after_task_execution was a v1 -> v2 migration seller, on top of the security incentive, for our IT teams :) ).
Any particularities/findings/correlated logs and events that happen around the failures then? Just knowing it happens does not bring us any closer to diagnosing it.
