pipecd icon indicating copy to clipboard operation
pipecd copied to clipboard

[ECS] 503 error occurred while PrimaryRollout because it deletes the canary too early

Open t-kikuc opened this issue 2 years ago • 2 comments

What happened:

The ECS_PRIMARY_ROLLOUT stage deleted the old task sets including the canary task set. -> The traffic-receiving target group lost instances to route traffic. -> 503 Service Temporarily Unavailable happened until the ECS_TRAFFIC_ROUTING stage ended in a Blue/Green case.

What you expected to happen:

The ECS_PRIMARY_ROLLOUT stage should delete only the old PRIMARY task sets and keep the canary task set alive.

We need to fix the below: https://github.com/pipe-cd/pipecd/blob/301e3673f448b6a4d2e86921827b84c937d09002/pkg/app/piped/executor/ecs/ecs.go#L242-L249

How to reproduce it:

When you use ECS_PRIMARY_ROLLOUT for ECS deployments, it will happen. (503 happens in Blue/Green)

Environment:

  • piped version: v0.45.4
  • control-plane version: v0.46.0-rc0-11-g301e367
  • Others: app.pipecd.yaml was as below (masked):
apiVersion: pipecd.dev/v1beta1
kind: ECSApp
spec:
  name: ecs-elb-bg-issue
  input:
    serviceDefinitionFile: servicedef.yaml
    taskDefinitionFile: taskdef.yaml
    targetGroups:
      primary:
        targetGroupArn: arn:aws:elasticloadbalancing:ap-northeast-1:<account-id>:targetgroup/t-kikuc-ecs-simple-target-group/326xxxxxxxxxxxxx
        containerName: web
        containerPort: 80
      canary:
        targetGroupArn: arn:aws:elasticloadbalancing:ap-northeast-1:<account-id>:targetgroup/t-kikuc-ecs-simple-target-group2/a0bxxxxxxxxxxxxx
        containerName: web
        containerPort: 80
  pipeline:
    stages:
      - name: ECS_CANARY_ROLLOUT
        with:
          scale: 100
      - name: WAIT_APPROVAL
      - name: ECS_TRAFFIC_ROUTING
        with:
          canary: 100
      - name: WAIT_APPROVAL
      - name: ECS_PRIMARY_ROLLOUT
      - name: WAIT_APPROVAL
      - name: ECS_TRAFFIC_ROUTING
        with:
          primary: 100
      - name: WAIT_APPROVAL
      - name: ECS_CANARY_CLEAN

Diagram Current: current

Desired: desired

t-kikuc avatar Dec 08 '23 09:12 t-kikuc

In the ECS_CANARY_CLEAN stage after ECS_PRIMARY_ROLLOUT, I failed to delete the canary task set because it's already removed.

The logs in Control Plane:

Failed to clean CANARY task set : failed to delete ECS task set : operation error ECS: DeleteTaskSet, https response error StatusCode: 400, RequestID: , TaskSetNotFoundException: Unable to find task set with id on service ecs-bg-broken-service-1.

t-kikuc avatar Dec 08 '23 10:12 t-kikuc

It's easy to modify not deleting canary while ECS_PRIMARY_ROLLOUT, but it's much easier to modify not deleting all old tasksets while ECS_PRIMARY_ROLLOUT.

I wonder which is better...

t-kikuc avatar Dec 22 '23 03:12 t-kikuc