empire icon indicating copy to clipboard operation
empire copied to clipboard

Malformed scheduled tasks cause stuck rollback

Open ejholmes opened this issue 8 years ago • 1 comments

This needs to be investigated further with a consistent repro, but it looks like a malformed scheduled task can cause the stack to enter a ROLLBACK state, resulting in some Custom::ECSService UPDATE requests to fail with:

Failed to update resource. ClientException: TaskDefinition is inactive status code: 400, request id: 48b61c90-f179-11e6-ab2f-657723f95720

which causes the stack to get in the dreaded failed rollback state. https://github.com/remind101/empire/pull/1040 worked as a temporary fix to allow the stack to continue rollback.

ejholmes avatar Feb 13 '17 20:02 ejholmes

Ok, going to update with my findings so far, at the moment, this appears to be a bug/race with how CloudFormation handles cancellation of custom resources, and isn't directly related to malformed scheduled tasks, just successive rollbacks.

I've put together a timeline here from a reproduction of the bug. Basically, I perform 4 steps, and the problem happens in the third step on the emails process:

  1. Deploy a Procfile with a malformed scheduled job. This updates the stack fine.
  2. Scale up the malformed scheduled job. As expected, this only updates the ECS service/task definition for the scheduled job process and rolls back properly when it encounters the malformed trigger. At this point, the task definition in use is gCWylPD6gCA.
  3. Deploy the same ref again. Here's where the problem happens. If a TaskDefinition update is "cancelled" it enters an "UPDATE_FAILED" state, however, when the stack begins rolling back, CloudFormation tries to roll back the task definition for some reason (it never updated, so why?). This causes a new task definition to be created (new physical resource id), however, the service is never actually updated with this new task definition. The stack successfully rolls back. At this point emailsTD is eliQ3lSP7mJ, however, the properties for emailsService still specifies gCWylPD6gCA as the task definition.
  4. Deploy the same ref again. Now CloudFormation attempts to create a new task definition eY97vimi28m and update the emailsService with it, but the bad trigger still causes a rollback. CloudFormation starts rolling back, and because it still thinks gCWylPD6gCA is the active task definition, it tries to rollback to that task definition, which was deleted in step 3 as part of the rollback.

So in a nutshell, the state between emailsService and emailsTD gets out of sync, causing CloudFormation to get confused.

I'll work on getting an isolated test case that reproduces the issue consistently.

I think if the postfix that we add to task definitions was consistent, it would solve the problem, since the task definition rollback in step 3 wouldn't cause a new physical resource id, and state wouldn't get our of sync, although, I feel like that shouldn't be necessary.

ejholmes avatar Feb 25 '17 06:02 ejholmes