aws-airflow-stack
aws-airflow-stack copied to clipboard
Long running tasks killed before completion
Hi vilasv,
First of - huge fan of your work.
I don't have many running dags and need only between 0 - 1 worker running at any time but when I set the default to 0 it seem to have some workers killed before they can complete their current task.
Looking at the lambda logs I can see this:
[INFO] 2020-06-10T02:26:06.898Z ee872f38-87da-41fc-b81e-2d2e9388f6b5 evaluating at [2020-06-10 02:15:00+00:00]
[INFO] 2020-06-10T02:26:07.71Z ee872f38-87da-41fc-b81e-2d2e9388f6b5 ANOMV=0.0 NOER=29.0 GISI=1.0
[INFO] 2020-06-10T02:26:07.72Z ee872f38-87da-41fc-b81e-2d2e9388f6b5 L=0.018054257581298416
So the load seems to drop below the low threshold while the task is still running and the task is killed before it has the chance to complete it.
Doing a bit of research : seems like the issue is that the graceful shutdown Lifecyle Hook is a 3 minute countdown for the EC2 Instance defined in the template
GracefulShutdownLifecycleHook:
Type: AWS::AutoScaling::LifecycleHook
Properties:
AutoScalingGroupName: !Ref AutoScalingGroup
DefaultResult: CONTINUE
HeartbeatTimeout: 180
LifecycleTransition: autoscaling:EC2_INSTANCE_TERMINATING
A possible solution is to change that timeout so that it is as long as the longest task (not ideal) or potentially running a lambda function that records a heartbeat https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html#lifecycle-hook-considerations.
Hm, I believe there's probably a template substitution bug on the lifecycle hook helper daemon, which would normally extend the timeout until the service is done. But you're right, if the service is not working, all you've got are those 3 minutes. I'm gonna go ahead and convert this issue to a bug report about the hearbeat service.