aws-airflow-stack icon indicating copy to clipboard operation
aws-airflow-stack copied to clipboard

Long running tasks killed before completion

Open RafaelAMello opened this issue 5 years ago • 2 comments

Hi vilasv,

First of - huge fan of your work.

I don't have many running dags and need only between 0 - 1 worker running at any time but when I set the default to 0 it seem to have some workers killed before they can complete their current task.

Looking at the lambda logs I can see this:

[INFO]	2020-06-10T02:26:06.898Z	ee872f38-87da-41fc-b81e-2d2e9388f6b5	evaluating at [2020-06-10 02:15:00+00:00]

[INFO]	2020-06-10T02:26:07.71Z	ee872f38-87da-41fc-b81e-2d2e9388f6b5	ANOMV=0.0 NOER=29.0 GISI=1.0

[INFO]	2020-06-10T02:26:07.72Z	ee872f38-87da-41fc-b81e-2d2e9388f6b5	L=0.018054257581298416

So the load seems to drop below the low threshold while the task is still running and the task is killed before it has the chance to complete it.

RafaelAMello avatar Jun 10 '20 02:06 RafaelAMello

Doing a bit of research : seems like the issue is that the graceful shutdown Lifecyle Hook is a 3 minute countdown for the EC2 Instance defined in the template

  GracefulShutdownLifecycleHook:
    Type: AWS::AutoScaling::LifecycleHook
    Properties:
      AutoScalingGroupName: !Ref AutoScalingGroup
      DefaultResult: CONTINUE
      HeartbeatTimeout: 180
      LifecycleTransition: autoscaling:EC2_INSTANCE_TERMINATING

A possible solution is to change that timeout so that it is as long as the longest task (not ideal) or potentially running a lambda function that records a heartbeat https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html#lifecycle-hook-considerations.

RafaelAMello avatar Jun 10 '20 03:06 RafaelAMello

Hm, I believe there's probably a template substitution bug on the lifecycle hook helper daemon, which would normally extend the timeout until the service is done. But you're right, if the service is not working, all you've got are those 3 minutes. I'm gonna go ahead and convert this issue to a bug report about the hearbeat service.

villasv avatar Jun 23 '20 21:06 villasv