[ECS] 503 error occurred while PrimaryRollout because it deletes the canary too early
What happened:
The ECS_PRIMARY_ROLLOUT stage deleted the old task sets including the canary task set.
-> The traffic-receiving target group lost instances to route traffic.
-> 503 Service Temporarily Unavailable happened until the ECS_TRAFFIC_ROUTING stage ended in a Blue/Green case.
What you expected to happen:
The ECS_PRIMARY_ROLLOUT stage should delete only the old PRIMARY task sets and keep the canary task set alive.
We need to fix the below: https://github.com/pipe-cd/pipecd/blob/301e3673f448b6a4d2e86921827b84c937d09002/pkg/app/piped/executor/ecs/ecs.go#L242-L249
How to reproduce it:
When you use ECS_PRIMARY_ROLLOUT for ECS deployments, it will happen.
(503 happens in Blue/Green)
Environment:
pipedversion: v0.45.4control-planeversion: v0.46.0-rc0-11-g301e367- Others:
app.pipecd.yamlwas as below (masked):
apiVersion: pipecd.dev/v1beta1
kind: ECSApp
spec:
name: ecs-elb-bg-issue
input:
serviceDefinitionFile: servicedef.yaml
taskDefinitionFile: taskdef.yaml
targetGroups:
primary:
targetGroupArn: arn:aws:elasticloadbalancing:ap-northeast-1:<account-id>:targetgroup/t-kikuc-ecs-simple-target-group/326xxxxxxxxxxxxx
containerName: web
containerPort: 80
canary:
targetGroupArn: arn:aws:elasticloadbalancing:ap-northeast-1:<account-id>:targetgroup/t-kikuc-ecs-simple-target-group2/a0bxxxxxxxxxxxxx
containerName: web
containerPort: 80
pipeline:
stages:
- name: ECS_CANARY_ROLLOUT
with:
scale: 100
- name: WAIT_APPROVAL
- name: ECS_TRAFFIC_ROUTING
with:
canary: 100
- name: WAIT_APPROVAL
- name: ECS_PRIMARY_ROLLOUT
- name: WAIT_APPROVAL
- name: ECS_TRAFFIC_ROUTING
with:
primary: 100
- name: WAIT_APPROVAL
- name: ECS_CANARY_CLEAN
Diagram
Current:
Desired:
In the ECS_CANARY_CLEAN stage after ECS_PRIMARY_ROLLOUT,
I failed to delete the canary task set because it's already removed.
The logs in Control Plane:
Failed to clean CANARY task set
: failed to delete ECS task set : operation error ECS: DeleteTaskSet, https response error StatusCode: 400, RequestID: , TaskSetNotFoundException: Unable to find task set with id on service ecs-bg-broken-service-1.
It's easy to modify not deleting canary while ECS_PRIMARY_ROLLOUT,
but it's much easier to modify not deleting all old tasksets while ECS_PRIMARY_ROLLOUT.
I wonder which is better...