EcsRunTaskOperator does not send logs if task times out
Apache Airflow Provider(s)
amazon
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==8.7.1
Apache Airflow version
2.7.2
Operating System
Amazon Linux AMI
Deployment
Amazon (AWS) MWAA
Deployment details
We are deploying MWAA using CDK. Dags are located in S3.
What happened
We are running several Airflow tasks that start ECS containers and wait for their completion before starting the following tasks. I noticed that logs only appear upon the completion of the task. However, if a task runs into a TimeoutError in Airflow, no task logs are created at all in Cloudwatch.
I have been in contact with AWS Enterprise Support but they are saying that the issue is located in the Operator and not MWAA.
What you think should happen instead
ECS task logs are written to CloudWatch even if the Airflow task fails. Even better: task logs are written continuously to CloudWatch.
How to reproduce
from datetime import datetime, timedelta
from airflow import DAG
from airflow.providers.amazon.aws.operators.ecs import EcsRunTaskOperator
DEFAULT_ARGS = {
"owner": "airflow",
"depends_on_past": False,
"start_date": datetime(2024, 1, 1),
}
with DAG(
"sample",
default_args=DEFAULT_ARGS,
description="A test DAG to try out functionality",
schedule=None,
) as dag:
EcsRunTaskOperator(
task_id="ecs_operator",
dag=dag,
execution_timeout=timedelta(minutes=2),
retries=3,
aws_conn_id="aws_default",
cluster="ecs-airflow-cluster",
task_definition="ecs-airflow-task:1",
launch_type="FARGATE",
overrides={"containerOverrides": [
{
"name": CONTAINER_NAME,
"command": ["python", "-c", "import time; for i in range(30): print(i); time.sleep(10)"],
},
]},
network_configuration={
"awsvpcConfiguration": {
"subnets": ["subnet-xxxxx", "subnet-xxx"],
"securityGroups": ["sg-xxxx"]
}
},
awslogs_group="ecs-airflow",
awslogs_region="eu-central-1",
awslogs_stream_prefix=f"ecs/{CONTAINER_NAME}",
awslogs_fetch_interval=timedelta(seconds=5)
)
Anything else
Logs example:
ip-xxxxx.eu-central-1.compute.internal
*** Reading remote log from Cloudwatch log_group: airflow-my_env_name-Task log_stream: dag_id=my_dag_id/run_id=scheduled__2024-04-02T05_30_00+00_00/task_id=mytaskid/attempt=1.log.
[2024-04-03, 07:30:19 CEST] {{taskinstance.py:1159}} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: my_dag_id.mytaskid scheduled__2024-04-02T05:30:00+00:00 [queued]>
[2024-04-03, 07:30:20 CEST] {{taskinstance.py:1159}} INFO - Dependencies all met for dep_context=requeueable deps ti=<TaskInstance: my_dag_id.mytaskid scheduled__2024-04-02T05:30:00+00:00 [queued]>
[2024-04-03, 07:30:20 CEST] {{taskinstance.py:1361}} INFO - Starting attempt 1 of 2
[2024-04-03, 07:30:20 CEST] {{taskinstance.py:1382}} INFO - Executing <Task(EcsRunTaskOperator): mytaskid> on 2024-04-02 05:30:00+00:00
[2024-04-03, 07:30:20 CEST] {{standard_task_runner.py:57}} INFO - Started process 24629 to run task
[2024-04-03, 07:30:20 CEST] {{standard_task_runner.py:84}} INFO - Running: ['airflow', 'tasks', 'run', 'my_dag_id', 'mytaskid', 'scheduled__2024-04-02T05:30:00+00:00', '--job-id', '3107', '--raw', '--subdir', 'DAGS_FOLDER/my_dag_file.py', '--cfg-path', '/tmp/tmpw6wsfwwu']
[2024-04-03, 07:30:20 CEST] {{standard_task_runner.py:85}} INFO - Job 3107: Subtask mytaskid
[2024-04-03, 07:30:20 CEST] {{task_command.py:416}} INFO - Running <TaskInstance: my_dag_id.mytaskid scheduled__2024-04-02T05:30:00+00:00 [running]> on host ip-10-123-136-202.eu-central-1.compute.internal
[2024-04-03, 07:30:21 CEST] {{taskinstance.py:1662}} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='di' AIRFLOW_CTX_DAG_ID='my_dag_id' AIRFLOW_CTX_TASK_ID='mytaskid' AIRFLOW_CTX_EXECUTION_DATE='2024-04-02T05:30:00+00:00' AIRFLOW_CTX_TRY_NUMBER='1' AIRFLOW_CTX_DAG_RUN_ID='scheduled__2024-04-02T05:30:00+00:00'
[2024-04-03, 07:30:21 CEST] {{ecs.py:531}} INFO - Running ECS Task - Task definition: my_task_definition_name - on cluster my-cluster-name
[2024-04-03, 07:30:21 CEST] {{ecs.py:534}} INFO - EcsOperator overrides: {'containerOverrides': []}
[2024-04-03, 07:30:22 CEST] {{base.py:73}} INFO - Using connection ID 'my_aws_account_connection_id' for task execution.
[2024-04-03, 07:30:24 CEST] {{ecs.py:644}} INFO - ECS Task started: {'tasks': [{'attachments': [{'id': 'eab65079-d61e-4c16-9984-cec51e01cae4', 'type': 'ElasticNetworkInterface', 'status': 'PRECREATED', 'details': [{'name': 'subnetId', 'value': 'subnet-12345'}]}], 'attributes': [{'name': 'ecs.cpu-architecture', 'value': 'x86_64'}], 'availabilityZone': 'eu-central-1b', 'clusterArn': 'arn:aws:ecs:eu-central-1:123456789:cluster/my-cluster-name', 'containers': [{'containerArn': 'arn:aws:ecs:eu-central-1:123456789:container/my-cluster-name/d6907e6e16c54ff0ba6bc477ef4f5c6e/10c5ac9b-9623-418b-afd7-10d2d0b762f6', 'taskArn': 'arn:aws:ecs:eu-central-1:123456789:task/my-cluster-name/d6907e6e16c54ff0ba6bc477ef4f5c6e', 'name': 'my_task_definition_name', 'image': '123456789.dkr.ecr.eu-central-1.amazonaws.com/my_task_definition_name:0.0.2', 'lastStatus': 'PENDING', 'networkInterfaces': [], 'cpu': '0'}], 'cpu': '512', 'createdAt': datetime.datetime(2024, 4, 3, 5, 30, 24, 868000, tzinfo=tzlocal()), 'desiredStatus': 'RUNNING', 'enableExecuteCommand': False, 'group': 'family:my_task_definition_name', 'lastStatus': 'PROVISIONING', 'launchType': 'FARGATE', 'memory': '1024', 'overrides': {'containerOverrides': [{'name': 'my_task_definition_name'}], 'inferenceAcceleratorOverrides': []}, 'platformVersion': '1.4.0', 'platformFamily': 'Linux', 'startedBy': 'di', 'tags': [], 'taskArn': 'arn:aws:ecs:eu-central-1:123456789:task/my-cluster-name/d6907e6e16c54ff0ba6bc477ef4f5c6e', 'taskDefinitionArn': 'arn:aws:ecs:eu-central-1:123456789:task-definition/my_task_definition_name:2', 'version': 1, 'ephemeralStorage': {'sizeInGiB': 20}}], 'failures': [], 'ResponseMetadata': {'RequestId': '02377efa-1c03-443e-aad8-c6d0d6157f06', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '02377efa-1c03-443e-aad8-c6d0d6157f06', 'content-type': 'application/x-amz-json-1.1', 'content-length': '1692', 'date': 'Wed, 03 Apr 2024 05:30:24 GMT'}, 'RetryAttempts': 0}}
[2024-04-03, 07:30:24 CEST] {{ecs.py:647}} INFO - ECS task ID is: d6907e6e16c54ff0ba6bc477ef4f5c6e
[2024-04-03, 07:30:25 CEST] {{ecs.py:573}} INFO - Starting ECS Task Log Fetcher
[2024-04-03, 07:30:55 CEST] {{base.py:73}} INFO - Using connection ID 'my_aws_account_connection_id' for task execution.
[2024-04-03, 07:50:21 CEST] {{timeout.py:68}} ERROR - Process timed out, PID: 24629
[2024-04-03, 07:50:21 CEST] {{ecs.py:757}} INFO - {'task': {'attachments': [{'id': 'eab65079-d61e-4c16-9984-cec51e01cae4', 'type': 'ElasticNetworkInterface', 'status': 'ATTACHED', 'details': [{'name': 'subnetId', 'value': 'subnet-12345'}, {'name': 'networkInterfaceId', 'value': 'eni-08793a79328aa8a69'}, {'name': 'macAddress', 'value': '06:08:01:5a:69:fb'}, {'name': 'privateDnsName', 'value': 'ip-10-123-41-14.eu-central-1.compute.internal'}, {'name': 'privateIPv4Address', 'value': 'xxxxx'}]}], 'attributes': [{'name': 'ecs.cpu-architecture', 'value': 'x86_64'}], 'availabilityZone': 'eu-central-1b', 'clusterArn': 'arn:aws:ecs:eu-central-1:123456789:cluster/my-cluster-name', 'connectivity': 'CONNECTED', 'connectivityAt': datetime.datetime(2024, 4, 3, 5, 30, 28, 546000, tzinfo=tzlocal()), 'containers': [{'containerArn': 'arn:aws:ecs:eu-central-1:123456789:container/my-cluster-name/d6907e6e16c54ff0ba6bc477ef4f5c6e/10c5ac9b-9623-418b-afd7-10d2d0b762f6', 'taskArn': 'arn:aws:ecs:eu-central-1:123456789:task/my-cluster-name/d6907e6e16c54ff0ba6bc477ef4f5c6e', 'name': 'my_task_definition_name', 'image': '123456789.dkr.ecr.eu-central-1.amazonaws.com/my_task_definition_name:0.0.2', 'imageDigest': 'sha256:9d8af306540177ee5faef1dea1f2c72c3e16a02aedcb0c85fdd42cf96f8eb7d9', 'runtimeId': 'd6907e6e16c54ff0ba6bc477ef4f5c6e-873919162', 'lastStatus': 'RUNNING', 'networkBindings': [], 'networkInterfaces': [{'attachmentId': 'eab65079-d61e-4c16-9984-cec51e01cae4', 'privateIpv4Address': 'xxxxx'}], 'cpu': '0'}], 'cpu': '512', 'createdAt': datetime.datetime(2024, 4, 3, 5, 30, 24, 868000, tzinfo=tzlocal()), 'desiredStatus': 'STOPPED', 'enableExecuteCommand': False, 'group': 'family:my_task_definition_name', 'lastStatus': 'RUNNING', 'launchType': 'FARGATE', 'memory': '1024', 'overrides': {'containerOverrides': [{'name': 'my_task_definition_name'}], 'inferenceAcceleratorOverrides': []}, 'platformVersion': '1.4.0', 'platformFamily': 'Linux', 'pullStartedAt': datetime.datetime(2024, 4, 3, 5, 30, 37, 967000, tzinfo=tzlocal()), 'pullStoppedAt': datetime.datetime(2024, 4, 3, 5, 31, 9, 202000, tzinfo=tzlocal()), 'startedAt': datetime.datetime(2024, 4, 3, 5, 31, 9, 595000, tzinfo=tzlocal()), 'startedBy': 'di', 'stopCode': 'UserInitiated', 'stoppedReason': 'Task killed by the user', 'stoppingAt': datetime.datetime(2024, 4, 3, 5, 50, 21, 287000, tzinfo=tzlocal()), 'tags': [], 'taskArn': 'arn:aws:ecs:eu-central-1:123456789:task/my-cluster-name/d6907e6e16c54ff0ba6bc477ef4f5c6e', 'taskDefinitionArn': 'arn:aws:ecs:eu-central-1:123456789:task-definition/my_task_definition_name:2', 'version': 4, 'ephemeralStorage': {'sizeInGiB': 20}}, 'ResponseMetadata': {'RequestId': '67522c9e-82c3-45e9-b0b1-2c71a9ce1fc0', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '67522c9e-82c3-45e9-b0b1-2c71a9ce1fc0', 'content-type': 'application/x-amz-json-1.1', 'content-length': '2470', 'date': 'Wed, 03 Apr 2024 05:50:20 GMT'}, 'RetryAttempts': 0}}
[2024-04-03, 07:50:21 CEST] {{taskinstance.py:1937}} ERROR - Task failed with exception
Traceback (most recent call last):
File "/usr/local/airflow/.local/lib/python3.11/site-packages/airflow/providers/amazon/aws/operators/ecs.py", line 578, in execute
self._wait_for_task_ended()
File "/usr/local/airflow/.local/lib/python3.11/site-packages/airflow/providers/amazon/aws/operators/ecs.py", line 669, in _wait_for_task_ended
waiter.wait(
File "/usr/local/airflow/.local/lib/python3.11/site-packages/botocore/waiter.py", line 55, in wait
Waiter.wait(self, **kwargs)
File "/usr/local/airflow/.local/lib/python3.11/site-packages/botocore/waiter.py", line 393, in wait
time.sleep(sleep_amount)
File "/usr/local/airflow/.local/lib/python3.11/site-packages/airflow/utils/timeout.py", line 69, in handle_timeout
raise AirflowTaskTimeout(self.error_message)
airflow.exceptions.AirflowTaskTimeout: Timeout, PID: 24629
[2024-04-03, 07:50:21 CEST] {{taskinstance.py:1400}} INFO - Marking task as UP_FOR_RETRY. dag_id=my_dag_id, task_id=mytaskid, execution_date=20240402T053000, start_date=20240403T053019, end_date=20240403T055021
[2024-04-03, 07:50:21 CEST] {{standard_task_runner.py:104}} ERROR - Failed to execute job 3107 for task mytaskid (Timeout, PID: 24629; 24629)
[2024-04-03, 07:50:21 CEST] {{local_task_job_runner.py:228}} INFO - Task exited with return code 1
[2024-04-03, 07:50:21 CEST] {{taskinstance.py:2778}} INFO - 0 downstream tasks scheduled from follow-on schedule check
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
May I handle this issue?
Feel free.
Hi @slycyberguy , are you still handling this issue? I am also interested in it, in case you need any help.
Hi @slycyberguy , are you still handling this issue? I am also interested in it, in case you need any help.
Hi. Yeah, I'm working on it. I should be done in a few more days. I'll let you know if anything changes. Thanks for offering.
Hi @slycyberguy , are you still handling this issue? I am also interested in it, in case you need any help.
Hi @mateuslatrova. It looks like I'm getting overwhelmed at work with a lot of things that require high priority at the moment. I think I might be free to work on it next week, but I don't want this sitting around for too long. So if you still want, you can take over the issue. I'm unassigning myself.
Hi @slycyberguy , are you still handling this issue? I am also interested in it, in case you need any help.
Hi @mateuslatrova. It looks like I'm getting overwhelmed at work with a lot of things that require high priority at the moment. I think I might be free to work on it next week, but I don't want this sitting around for too long. So if you still want, you can take over the issue. I'm unassigning myself.
Hi @slycyberguy !
Right now I am focused on another issue. If I finish it before anyone takes this one over, I will let everyone know that I will work on it.
Thanks for letting me know!
Hi @anneadb, in your logs example you are reading logs from the log group airflow-my_env_name-Task but in your DAG you specify ecs-airflow as log group. Did you check if in ecs-airflow you can find logs?
Hi @vincbeck, the original ECS logs were empty as well. I believe that Airflow fetches the logs from the other log group at some point and integrates them in the task log.
For example the successful ECS log looks like this:
2024-05-31T06:07:01.798Z | Package Version
2024-05-31T06:07:01.799Z | ------------------- -----------
2024-05-31T06:07:01.800Z | aiobotocore 2.11.2
2024-05-31T06:07:01.801Z | aiohttp 3.9.5
2024-05-31T06:07:01.803Z | aioitertools 0.11.0
2024-05-31T06:07:01.804Z | aiosignal 1.3.1
And then the successful task log looks like this:
2024-05-31T06:06:13.197Z [2024-05-31T06:06:13.197+0000] {{ecs.py:647}} INFO - ECS task ID is: e4bf16d0600b5cccaed7912ed8444c9c
2024-05-31T06:06:13.230Z [2024-05-31T06:06:13.230+0000] {{ecs.py:573}} INFO - Starting ECS Task Log Fetcher
2024-05-31T06:06:43.271Z [2024-05-31T06:06:43.271+0000] {{base.py:73}} INFO - Using connection ID 'aws_12345678910' for task execution.
2024-05-31T06:07:14.155Z [2024-05-31T06:07:14.155+0000] {{task_log_fetcher.py:63}} INFO - [2024-05-31 06:07:01,798] Package Version
2024-05-31T06:07:14.155Z [2024-05-31T06:07:14.155+0000] {{task_log_fetcher.py:63}} INFO - [2024-05-31 06:07:01,799] ------------------- -----------
2024-05-31T06:07:14.155Z [2024-05-31T06:07:14.155+0000] {{task_log_fetcher.py:63}} INFO - [2024-05-31 06:07:01,800] aiobotocore 2.11.2
2024-05-31T06:07:14.155Z [2024-05-31T06:07:14.155+0000] {{task_log_fetcher.py:63}} INFO - [2024-05-31 06:07:01,801] aiohttp 3.9.5
2024-05-31T06:07:14.155Z [2024-05-31T06:07:14.155+0000] {{task_log_fetcher.py:63}} INFO - [2024-05-31 06:07:01,803] aioitertools 0.11.0
2024-05-31T06:07:14.155Z [2024-05-31T06:07:14.155+0000] {{task_log_fetcher.py:63}} INFO - [2024-05-31 06:07:01,804] aiosignal 1.3.1
2024-05-31T06:06:13.230Z [2024-05-31T06:06:13.230+0000] {{ecs.py:573}} INFO - Starting ECS Task Log Fetcher 2024-05-31T06:06:43.271Z [2024-05-31T06:06:43.271+0000] {{base.py:73}} INFO - Using connection ID 'aws_12345678910' for task execution.
I see. I can see that even with a successful task, logs are pushed to Cloudwatch when the task is done.
In both examples (successful task and timed-out task), after the two lines below, there is always a delay. I guess this is the issue, I'll take a look.
2024-05-31T06:06:13.230Z [2024-05-31T06:06:13.230+0000] {{ecs.py:573}} INFO - Starting ECS Task Log Fetcher
2024-05-31T06:06:43.271Z [2024-05-31T06:06:43.271+0000] {{base.py:73}} INFO - Using connection ID 'aws_12345678910' for task execution.
Hey @anneadb, I cannot reproduce the issue. When I run the same DAG as the one in the description, I get the logs in real time in my cloudwatch log group. The task does time out but I get the logs:
Mon Jun 3 21:13:30 UTC 2024
--
Mon Jun 3 21:13:40 UTC 2024
Mon Jun 3 21:13:50 UTC 2024
Mon Jun 3 21:14:00 UTC 2024
Mon Jun 3 21:14:10 UTC 2024
Mon Jun 3 21:14:20 UTC 2024
Mon Jun 3 21:14:30 UTC 2024
Mon Jun 3 21:14:40 UTC 2024
Mon Jun 3 21:14:50 UTC 2024
Mon Jun 3 21:15:00 UTC 2024
Mon Jun 3 21:15:10 UTC 2024
Mon Jun 3 21:15:20 UTC 2024
Mon Jun 3 21:15:30 UTC 2024
Mon Jun 3 21:15:40 UTC 2024
[2024-06-03, 21:20:21 UTC] {ecs.py:638} INFO - ECS Task started: {'tasks': [{'attachments': [{'id': '0fbe5b10-140f-4f0e-a784-d2926a536039', 'type': 'ElasticNetworkInterface', 'status': 'PRECREATED', 'details': [{'name': 'subnetId', 'value': 'subnet-03fe35179d25a0036'}]}], 'attributes': [{'name': 'ecs.cpu-architecture', 'value': 'x86_64'}], 'availabilityZone': 'us-east-1b', 'clusterArn': 'arn:aws:ecs:us-east-1:722404908466:cluster/SysTestCluster_example_ecs', 'containerInstanceArn': 'arn:aws:ecs:us-east-1:722404908466:container-instance/SysTestCluster_example_ecs/2c5a500f118248869ab01a84f9ece973', 'containers': [{'containerArn': 'arn:aws:ecs:us-east-1:722404908466:container/SysTestCluster_example_ecs/b392893f198842678e63d2258de21bd8/40baa9f8-fc4f-4709-b7f2-d48275d18e0f', 'taskArn': 'arn:aws:ecs:us-east-1:722404908466:task/SysTestCluster_example_ecs/b392893f198842678e63d2258de21bd8', 'name': 'enve15c32d4-container', 'image': 'ubuntu', 'lastStatus': 'PENDING', 'networkInterfaces': [], 'cpu': '0'}], 'cpu': '256', 'createdAt': datetime.datetime(2024, 6, 3, 21, 20, 21, 47000, tzinfo=tzlocal()), 'desiredStatus': 'RUNNING', 'enableExecuteCommand': False, 'group': 'family:enve15c32d4-task-definition', 'lastStatus': 'PROVISIONING', 'launchType': 'EC2', 'memory': '512', 'overrides': {'containerOverrides': [{'name': 'enve15c32d4-container', 'command': ['for i in $(seq 1 10); do date; sleep 10; done']}], 'inferenceAcceleratorOverrides': []}, 'startedBy': '***', 'tags': [], 'taskArn': 'arn:aws:ecs:us-east-1:722404908466:task/SysTestCluster_example_ecs/b392893f198842678e63d2258de21bd8', 'taskDefinitionArn': 'arn:aws:ecs:us-east-1:722404908466:task-definition/enve15c32d4-task-definition:1', 'version': 1}], 'failures': [], 'ResponseMetadata': {'RequestId': '094dc144-7408-4e5c-b64c-1171f80064ed', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '094dc144-7408-4e5c-b64c-1171f80064ed', 'content-type': 'application/x-amz-json-1.1', 'content-length': '1550', 'date': 'Mon, 03 Jun 2024 21:20:20 GMT'}, 'RetryAttempts': 0}}
[2024-06-03, 21:20:21 UTC] {ecs.py:641} INFO - ECS task ID is: b392893f198842678e63d2258de21bd8
[2024-06-03, 21:20:21 UTC] {ecs.py:564} INFO - Starting ECS Task Log Fetcher
[2024-06-03, 21:20:26 UTC] {base.py:84} INFO - Using connection ID 'aws_default' for task execution.
[2024-06-03, 21:20:26 UTC] {credentials.py:1278} INFO - Found credentials in shared credentials file: ~/.aws/credentials
[2024-06-03, 21:20:26 UTC] {task_log_fetcher.py:76} INFO - Cannot find log stream yet, it can take a couple of seconds to show up. If this error persists, check that the log group and stream are correct: group: /ecs_test/enve15c32d4 stream: ecs/enve15c32d4-container/b392893f198842678e63d2258de21bd8
[2024-06-03, 21:20:32 UTC] {task_log_fetcher.py:76} INFO - Cannot find log stream yet, it can take a couple of seconds to show up. If this error persists, check that the log group and stream are correct: group: /ecs_test/enve15c32d4 stream: ecs/enve15c32d4-container/b392893f198842678e63d2258de21bd8
[2024-06-03, 21:20:52 UTC] {task_log_fetcher.py:63} INFO - [2024-06-03 21:20:44,439] Mon Jun 3 21:20:44 UTC 2024
[2024-06-03, 21:21:03 UTC] {task_log_fetcher.py:63} INFO - [2024-06-03 21:20:54,441] Mon Jun 3 21:20:54 UTC 2024
[2024-06-03, 21:21:13 UTC] {task_log_fetcher.py:63} INFO - [2024-06-03 21:21:04,443] Mon Jun 3 21:21:04 UTC 2024
[2024-06-03, 21:21:24 UTC] {task_log_fetcher.py:63} INFO - [2024-06-03 21:21:14,445] Mon Jun 3 21:21:14 UTC 2024
[2024-06-03, 21:21:29 UTC] {task_log_fetcher.py:63} INFO - [2024-06-03 21:21:24,447] Mon Jun 3 21:21:24 UTC 2024
[2024-06-03, 21:21:40 UTC] {task_log_fetcher.py:63} INFO - [2024-06-03 21:21:34,449] Mon Jun 3 21:21:34 UTC 2024
[2024-06-03, 21:21:50 UTC] {task_log_fetcher.py:63} INFO - [2024-06-03 21:21:44,451] Mon Jun 3 21:21:44 UTC 2024
[2024-06-03, 21:22:01 UTC] {task_log_fetcher.py:63} INFO - [2024-06-03 21:21:54,453] Mon Jun 3 21:21:54 UTC 2024
[2024-06-03, 21:22:11 UTC] {task_log_fetcher.py:63} INFO - [2024-06-03 21:22:04,455] Mon Jun 3 21:22:04 UTC 2024
[2024-06-03, 21:22:20 UTC] {timeout.py:68} ERROR - Process timed out, PID: 2131
[2024-06-03, 21:22:20 UTC] {taskinstance.py:741} INFO - ::group::Post task execution logs
[2024-06-03, 21:22:20 UTC] {ecs.py:751} INFO - {'task': {'attachments': [], 'availabilityZone': 'us-east-1b', 'clusterArn': 'arn:aws:ecs:us-east-1:722404908466:cluster/SysTestCluster_example_ecs', 'connectivity': 'CONNECTED', 'connectivityAt': datetime.datetime(2024, 6, 3, 21, 20, 21, 47000, tzinfo=tzlocal()), 'containerInstanceArn': 'arn:aws:ecs:us-east-1:722404908466:container-instance/SysTestCluster_example_ecs/2c5a500f118248869ab01a84f9ece973', 'containers': [], 'cpu': '256', 'createdAt': datetime.datetime(2024, 6, 3, 21, 20, 21, 47000, tzinfo=tzlocal()), 'desiredStatus': 'STOPPED', 'enableExecuteCommand': False, 'executionStoppedAt': datetime.datetime(2024, 6, 3, 21, 22, 14, 528000, tzinfo=tzlocal()), 'group': 'family:enve15c32d4-task-definition', 'lastStatus': 'DEPROVISIONING', 'launchType': 'EC2', 'memory': '512', 'overrides': {'containerOverrides': [], 'inferenceAcceleratorOverrides': []}, 'pullStartedAt': datetime.datetime(2024, 6, 3, 21, 20, 34, 432000, tzinfo=tzlocal()), 'pullStoppedAt': datetime.datetime(2024, 6, 3, 21, 20, 34, 579000, tzinfo=tzlocal()), 'startedAt': datetime.datetime(2024, 6, 3, 21, 20, 34, 471000, tzinfo=tzlocal()), 'startedBy': '***', 'stopCode': 'EssentialContainerExited', 'stoppedReason': 'Essential container in task exited', 'stoppingAt': datetime.datetime(2024, 6, 3, 21, 22, 15, tzinfo=tzlocal()), 'tags': [], 'taskArn': 'arn:aws:ecs:us-east-1:722404908466:task/SysTestCluster_example_ecs/b392893f198842678e63d2258de21bd8', 'taskDefinitionArn': 'arn:aws:ecs:us-east-1:722404908466:task-definition/enve15c32d4-task-definition:1', 'version': 4}, 'ResponseMetadata': {'RequestId': 'd60743a6-6116-4b48-99e7-df188833aab5', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'd60743a6-6116-4b48-99e7-df188833aab5', 'content-type': 'application/x-amz-json-1.1', 'content-length': '1148', 'date': 'Mon, 03 Jun 2024 21:22:19 GMT'}, 'RetryAttempts': 0}}
[2024-06-03, 21:22:20 UTC] {taskinstance.py:3137} ERROR - Task failed with exception
Traceback (most recent call last):
File "/opt/airflow/airflow/models/taskinstance.py", line 760, in _execute_task
result = _execute_callable(context=context, **execute_callable_kwargs)
File "/opt/airflow/airflow/models/taskinstance.py", line 728, in _execute_callable
return ExecutionCallableRunner(
File "/opt/airflow/airflow/utils/operator_helpers.py", line 250, in run
return self.func(*args, **kwargs)
File "/opt/airflow/airflow/models/baseoperator.py", line 406, in wrapper
return func(self, *args, **kwargs)
File "/opt/airflow/airflow/providers/amazon/aws/operators/ecs.py", line 569, in execute
self._wait_for_task_ended()
File "/opt/airflow/airflow/providers/amazon/aws/operators/ecs.py", line 663, in _wait_for_task_ended
waiter.wait(
File "/usr/local/lib/python3.8/site-packages/botocore/waiter.py", line 55, in wait
Waiter.wait(self, **kwargs)
File "/usr/local/lib/python3.8/site-packages/botocore/waiter.py", line 393, in wait
time.sleep(sleep_amount)
File "/opt/airflow/airflow/utils/timeout.py", line 69, in handle_timeout
raise AirflowTaskTimeout(self.error_message)
airflow.exceptions.AirflowTaskTimeout: Timeout, PID: 2131
Hi @vincbeck , thank you very much for checking. Now I really don't know why we do not get the logs.
Could you share your logging configuration? Maybe I can find a difference there.
Ours looks like this:
logging base_log_folder /usr/local/airflow/logs
logging remote_logging True
logging remote_log_conn_id aws_default
logging delete_local_logs False
logging google_key_path
logging remote_base_log_folder cloudwatch://arn:aws:logs:::log-group:airflow-logs:*
logging remote_task_handler_kwargs < hidden >
logging encrypt_s3_logs False
logging logging_level INFO
logging celery_logging_level
logging fab_logging_level WARNING
logging logging_config_class log_config.LOGGING_CONFIG
logging colored_console_log True
logging colored_log_format [%(blue)s%(asctime)s%(reset)s] {{%(blue)s%(filename)s:%(reset)s%(lineno)d}} %(log_color)s%(levelname)s%(reset)s - %(log_color)s%(message)s%(reset)s
logging colored_formatter_class airflow.utils.log.colored_log.CustomTTYColoredFormatter
logging log_format [%(asctime)s] {{%(filename)s:%(lineno)d}} %(levelname)s - %(message)s
logging simple_log_format %(asctime)s %(levelname)s - %(message)s
logging dag_processor_log_target file
logging dag_processor_log_format [%(asctime)s] [SOURCE:DAG_PROCESSOR] {%(filename)s:%(lineno)d} %(levelname)s - %(message)s
logging log_formatter_class airflow.utils.log.timezone_aware.TimezoneAware
logging secret_mask_adapter
logging task_log_prefix_template
logging log_filename_template dag_id={{ ti.dag_id }}/run_id={{ ti.run_id }}/task_id={{ ti.task_id }}/{% if ti.map_index >= 0 %}map_index={{ ti.map_index }}/{% endif %}attempt={{ try_number }}.log
logging log_processor_filename_template {{ filename }}.log
logging dag_processor_manager_log_location /usr/local/airflow/logs/dag_processor_manager/dag_processor_manager.log
logging task_log_reader task
logging extra_logger_names
logging worker_log_server_port 8793
logging trigger_log_server_port 8794
logging file_task_handler_new_folder_permissions 0o775
logging file_task_handler_new_file_permissions 0o664
logging celery_stdout_stderr_separation False
I am using Breeze so I guess I have the default configuration. But it should not even matter because I am looking at logs inside the log group I specified through the awslogs_group parameter. The operator pushes logs to this log group and not the log configuration. Maybe, somehow, it is a question of Amazon provider package version. Could you try with the latest Amazon provider package version?
Since we're currently on MWAA 2.7.2, according to https://docs.aws.amazon.com/mwaa/latest/userguide/connections-packages.html we are using apache-airflow-providers-amazon[aiobotocore]==8.7.1.
I found some hints that the issue may have to do with the "remote_logging" configuration that's why I was curious to know what your config looks like.
I'm converting this to discussion because for the moment this is not reproducible and more troubleshooting