amazon-ecs-deploy-task-definition icon indicating copy to clipboard operation
amazon-ecs-deploy-task-definition copied to clipboard

GitHub action should fail if deployment circuit breaker / rollback is triggered

Open spyoungtech opened this issue 3 years ago • 14 comments

Problem statement

Right now, this action only waits for service stabilization (without regard to how that happens or what deployment actually becomes stable)

With the new deployment circuit breaker feature, it's possible that a deployment will fail, but the service will eventually become stable when the rollback deployment occurs.

This means that workflows may continue on to subsequent steps, even though the service deployment actually failed

What is the bug/problem behavior?

The GitHub action reports success in the case that the service stabilizes after a rollback is triggered by the circuit breaker

This is present in aws-actions/amazon-ecs-deploy-task-definition@v1

What is the desired/correct behavior

The GitHub action should be considered a failure if the deployment circuit breaker is triggered

Steps to reproduce

  1. Create an ECS service with deployment circuit breaker with rollbacks enabled
  2. Deploy a change with this action that's expected to fail and trigger the circuit breaker
  3. Observe that the action job is successful, despite the deployment failing.

Notes

Possible implementation details

Perhaps the action could, upon stabilization, describe the service and see that the expected task revision is contained in the active deployment.

Workarounds

A possible workaround is to add an additional step in your GitHub actions workflow to check the describe-services API response to see if the circuit breaker was triggered on the deployment and the expected taskDefinition revision (using the task definition ARN output) is present in the active deployment.

spyoungtech avatar Apr 05 '21 17:04 spyoungtech

How is the deployment circuit breaker enabled? I do not see the option being submitted to ecs.updateService. https://github.com/aws-actions/amazon-ecs-deploy-task-definition/blob/master/index.js#L26

CalebC-RW avatar Feb 07 '22 21:02 CalebC-RW

@CalebC-RW the circuit breaker is an ECS service feature. You can see it referenced in the link above or in the AWS CLI ECS update-service documentation for example.

spyoungtech avatar Feb 08 '22 06:02 spyoungtech

@spyoungtech, thanks for replying! I see that the circuit breaker feature can be enabled on the service (using the old ECS experience on the console or when creating the service). We have tried to enable it on the service and then run this action, but it seems to not be taking effect (failing deployments are being retried repeatedly and not being rolled back). Does it need to be enabled with the "updateService" call (during deployment) as well as be enabled for the service? If it does, this action does not seem to enable the flag when it calls updateService.

Or, am I completely misunderstanding this feature.

CalebC-RW avatar Feb 08 '22 14:02 CalebC-RW

I believe I have addressed my questions. The rollback process does seem to work (without specifying it on the update-service instruction), though it does take time. I have not determined how many attempts that the deployment makes before the circuit breaker kicks in and performs the rollback. My services are such that if the deployment fails, they will almost certainly fail after the first deployment attempt and consistently thereafter. After many tests, we have determined that setting the wait-time to a little longer than a successful deployment but shorter than a few failures lets the action fail when the deployment fails and succeed when it succeeds.

I wish that this action/deployment process let us more explicitly specify the number of deployment attempts before rollback and to let the action fail when a rollback occurs (as @spyoungtech originally requested).

CalebC-RW avatar Feb 17 '22 13:02 CalebC-RW

We built into our pipeline a post-deployment check that verifies that the new registered revision is the current task definition in ECS.

erikahansen avatar Mar 06 '23 11:03 erikahansen

Like @erikahansen pointed out, we also ended up adding an extra step doing a post deployment check to verify this. Here's how we did it:

...
- name: Deploy to ECS
  uses: aws-actions/amazon-ecs-deploy-task-definition@v1
  id: ecs-deploy
  with:
    task-definition: ${{ steps.task-def.outputs.task-definition }}
    service: ${{ matrix.service }}
    cluster: ${{ env.CLUSTER_NAME }}
- name: Check if deployment was successful
  id: check-deployment
  run: |
     CURRENT_TASK_DEF_ARN=$(aws ecs describe-services --cluster ${{ env.CLUSTER_NAME }} --services ${{ matrix.service }} --query services[0].deployments[0].taskDefinition | jq -r ".")
     NEW_TASK_DEF_ARN=${{ steps.ecs-deploy.outputs.task-definition-arn }}
     echo "Current task arn: $CURRENT_TASK_DEF_ARN"
     echo "New task arn: $NEW_TASK_DEF_ARN"
     if [ "$CURRENT_TASK_DEF_ARN" != "$NEW_TASK_DEF_ARN" ]; then
       echo "Deployment failed."
       exit 1
     fi
...

However, it would be good to have this action checking that already... I will try to find the time to work on it.

bmbferreira avatar Jul 09 '23 01:07 bmbferreira

+1 for this, please make this workaround part of the core functionality of this Action.

chrispetsos avatar Aug 03 '23 07:08 chrispetsos

+1 we need this feature too, currently failures in prod are hidden without this workaround

eagleeyec avatar Aug 03 '23 08:08 eagleeyec

+1 for this. I need this workaround to be a part of the core functionality.

lexabug avatar Aug 14 '23 07:08 lexabug

+1 critically important feature

jdjkelly avatar Apr 18 '24 00:04 jdjkelly

Very important to fix

healthbjk avatar Apr 18 '24 00:04 healthbjk

+1 really should be part of core functionality now

larry-flexpa avatar Apr 18 '24 01:04 larry-flexpa

+1 for this feature

raf-stx avatar Jun 10 '24 07:06 raf-stx

Hi, Is there any update on this?

vitalykarasik avatar Jun 27 '24 09:06 vitalykarasik