Malformed scheduled tasks cause stuck rollback
This needs to be investigated further with a consistent repro, but it looks like a malformed scheduled task can cause the stack to enter a ROLLBACK state, resulting in some Custom::ECSService UPDATE requests to fail with:
Failed to update resource. ClientException: TaskDefinition is inactive status code: 400, request id: 48b61c90-f179-11e6-ab2f-657723f95720
which causes the stack to get in the dreaded failed rollback state. https://github.com/remind101/empire/pull/1040 worked as a temporary fix to allow the stack to continue rollback.
Ok, going to update with my findings so far, at the moment, this appears to be a bug/race with how CloudFormation handles cancellation of custom resources, and isn't directly related to malformed scheduled tasks, just successive rollbacks.
I've put together a timeline here from a reproduction of the bug. Basically, I perform 4 steps, and the problem happens in the third step on the emails process:
- Deploy a Procfile with a malformed scheduled job. This updates the stack fine.
- Scale up the malformed scheduled job. As expected, this only updates the ECS service/task definition for the scheduled job process and rolls back properly when it encounters the malformed trigger. At this point, the task definition in use is
gCWylPD6gCA. - Deploy the same ref again. Here's where the problem happens. If a TaskDefinition update is "cancelled" it enters an "UPDATE_FAILED" state, however, when the stack begins rolling back, CloudFormation tries to roll back the task definition for some reason (it never updated, so why?). This causes a new task definition to be created (new physical resource id), however, the service is never actually updated with this new task definition. The stack successfully rolls back. At this point emailsTD is
eliQ3lSP7mJ, however, the properties for emailsService still specifiesgCWylPD6gCAas the task definition. - Deploy the same ref again. Now CloudFormation attempts to create a new task definition
eY97vimi28mand update theemailsServicewith it, but the bad trigger still causes a rollback. CloudFormation starts rolling back, and because it still thinksgCWylPD6gCAis the active task definition, it tries to rollback to that task definition, which was deleted in step 3 as part of the rollback.
So in a nutshell, the state between emailsService and emailsTD gets out of sync, causing CloudFormation to get confused.
I'll work on getting an isolated test case that reproduces the issue consistently.
I think if the postfix that we add to task definitions was consistent, it would solve the problem, since the task definition rollback in step 3 wouldn't cause a new physical resource id, and state wouldn't get our of sync, although, I feel like that shouldn't be necessary.