amazon-ecs-agent icon indicating copy to clipboard operation
amazon-ecs-agent copied to clipboard

essential container is pending after enabling service connect for ECS on EC2

Open zxkane opened this issue 1 year ago • 20 comments

Summary

Every thing works well. However, the essential container always is pending after enabling service connect on ECS on EC2.

While updating the ECS service to Fargate, it works again with service connect enabled.

I found the ECS agent lost the connection to ECS service when 'sending status change to ECS'.

Description

I'm using below code snippet to create ECS service on EC2 with service connect via CDK,

        const ec2Service = new Ec2Service(this, 'DifyEcsClusterEC2SandboxService', {
            cluster: this.cluster,
            taskDefinition: this.sandboxTaskDefinition,
            desiredCount: 1,
            serviceName: 'ec2-dify-sandbox',
            vpcSubnets: this.cluster.vpc.selectSubnets({ subnetType: SubnetType.PRIVATE_WITH_EGRESS }),
            securityGroups: [this.securityGroup],
            capacityProviderStrategies: [{ capacityProvider: this.clusterDefaultCapacityProviderName, weight: 1 }]
        })

        ec2Service.addPlacementStrategies(
            PlacementStrategy.spreadAcross('attribute:ecs.availability-zone'),
            PlacementStrategy.spreadAcrossInstances(),
        )

        ec2Service.enableServiceConnect({
            namespace: this.cluster.defaultCloudMapNamespace?.namespaceArn,
            services: [{
                portMappingName: this.sandboxTaskDefinition.defaultContainer?.portMappings[0].name || "serverless-dify-sandbox-8194-tcp",
                dnsName: "dify-sandbox-on-ec2.serverless",
                discoveryName: "serverless-dify-sandbox-on-ec2",
            }],
            logDriver: LogDriver.awsLogs({
                streamPrefix: 'serverless-dify/service-connect/',
                logGroup: new LogGroup(this, 'service-connect-log-on-ec2', {
                    retention: RetentionDays.ONE_WEEK,
                  }),
            })
        })

If the service connect is disabled, the container started and ran well. However, it always is pending after enabling service connect.

After inspecting the logs of ecs-agent container on EC2, I found below output in logs.

level=warn time=2024-10-15T05:19:31Z msg="Received an unrecognized attachment property" attachmentProperty="{\n  Name: \"EcsTaskSetArn\",\n  Value: \"arn:aws:ecs:ap-northeast-1:845861764576:task-set/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/serverless-dify-sandbox/ecs-svc/7792931705797793743\"\n}"
level=info time=2024-10-15T05:19:31Z msg="Received task payload from ACS" taskARN="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db" taskVersion="3" desiredStatus=RUNNING
level=info time=2024-10-15T05:19:31Z msg="Found application credentials for task" taskVersion="3" roleARN="arn:aws:iam::845861764576:role/ServerlessDifyStack-EcsCl-ServerlessDifyClusterSand-UHkUiKUZcERq" roleType="TaskApplication" credentialsID="50e0b19f-38e1-45c6-a580-07a7ab5c6cbb" taskARN="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:31Z msg="Found execution credentials for task" roleType="TaskExecution" credentialsID="85db1dd5-4c11-4e98-8345-7bbe668de81c" taskARN="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db" taskVersion="3" roleARN="arn:aws:iam::845861764576:role/ServerlessDifyStack-EcsCl-ServerlessDifyClusterSand-UHkUiKUZcERq"
level=info time=2024-10-15T05:19:31Z msg="Resources successfully consumed, continue to task creation" taskArn="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:31Z msg="Host resources consumed, progressing task" taskARN="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:31Z msg="Digest resolution not required" taskARN="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db" containerName="~internal~ecs~pause" image="amazon/amazon-ecs-pause:0.1.0"
level=info time=2024-10-15T05:19:31Z msg="Handling container change event" knownStatus="NONE" desiredStatus="RESOURCES_PROVISIONED" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause" runtimeID="" changeEventStatus="MANIFEST_PULLED"
level=info time=2024-10-15T05:19:31Z msg="Creating task cgroup taskARN=arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db cgroupPath=/ecs/2700c02483254aef95cbc43f497f16db cgroupV2=false" module=cgroup.go
level=info time=2024-10-15T05:19:31Z msg="Transitioned resource" status="CREATED" task="2700c02483254aef95cbc43f497f16db" resource="cgroup"
level=info time=2024-10-15T05:19:31Z msg="Managed task got resource" resource="cgroup" status="CREATED" task="2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:31Z msg="Handling container change event" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause" runtimeID="" changeEventStatus="PULLED" knownStatus="MANIFEST_PULLED" desiredStatus="RESOURCES_PROVISIONED"
level=info time=2024-10-15T05:19:31Z msg="Creating container" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause"
level=info time=2024-10-15T05:19:31Z msg="Created container name mapping for task" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause" dockerContainerName="ecs-serverless-dify-sandbox-3-internalecspause-b4edaff8a49797bb4500"
level=info time=2024-10-15T05:19:31Z msg="Created docker container for task" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause" dockerId="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9" elapsed=74.224358ms
level=info time=2024-10-15T05:19:31Z msg="Handling container change event" runtimeID="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9" changeEventStatus="CREATED" knownStatus="PULLED" desiredStatus="RESOURCES_PROVISIONED" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause"
level=info time=2024-10-15T05:19:31Z msg="Starting container" runtimeID="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause"
level=info time=2024-10-15T05:19:31Z msg="Started container" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause" runtimeID="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9" elapsed=171.265595ms
level=info time=2024-10-15T05:19:31Z msg="Handling container change event" changeEventStatus="RUNNING" knownStatus="CREATED" desiredStatus="RESOURCES_PROVISIONED" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause" runtimeID="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9"
level=info time=2024-10-15T05:19:31Z msg="Start streaming metrics for container" runtimeID="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9"
level=info time=2024-10-15T05:19:31Z msg="Setting up container resources for container" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause"
level=info time=2024-10-15T05:19:31Z msg="Setting up CNI config for task" cniContainerNetNs="/host/proc/24031/ns/net" task="2700c02483254aef95cbc43f497f16db" cniContainerID="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9" cniPluginPath="" cniID="06:d5:89:0d:2f:2d" cniBridgeName=""
level=info time=2024-10-15T05:19:31Z msg="Task associated with ip address" task="2700c02483254aef95cbc43f497f16db" ip="169.254.172.4"
level=info time=2024-10-15T05:19:31Z msg="Handling container change event" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause" runtimeID="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9" changeEventStatus="RESOURCES_PROVISIONED" knownStatus="RUNNING" desiredStatus="RESOURCES_PROVISIONED"
level=info time=2024-10-15T05:19:31Z msg="Handling container change event" changeEventStatus="MANIFEST_PULLED" knownStatus="NONE" desiredStatus="RUNNING" task="2700c02483254aef95cbc43f497f16db" container="ecs-service-connect-nconQf" runtimeID=""
level=info time=2024-10-15T05:19:31Z msg="Handling container change event" container="sandbox" runtimeID="" changeEventStatus="MANIFEST_PULLED" knownStatus="NONE" desiredStatus="RUNNING" task="2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:31Z msg="Container change also resulted in task change" runtimeID="" desiredStatus="RUNNING" knownStatus="MANIFEST_PULLED" task="2700c02483254aef95cbc43f497f16db" container="ecs-service-connect-nconQf"
level=info time=2024-10-15T05:19:31Z msg="Handling container change event" container="ecs-service-connect-nconQf" runtimeID="" changeEventStatus="PULLED" knownStatus="MANIFEST_PULLED" desiredStatus="RUNNING" task="2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:31Z msg="Creating container" task="2700c02483254aef95cbc43f497f16db" container="ecs-service-connect-nconQf"
level=info time=2024-10-15T05:19:31Z msg="Applying execution role credentials to container log auth" awslogs-credentials-endpoint="/v2/credentials/85db1dd5-4c11-4e98-8345-7bbe668de81c" taskARN="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db" roleType="TaskExecution" roleARN="arn:aws:iam::845861764576:role/ServerlessDifyStack-EcsCl-ServerlessDifyClusterSand-UHkUiKUZcERq" credentialsID="85db1dd5-4c11-4e98-8345-7bbe668de81c"
level=info time=2024-10-15T05:19:31Z msg="Created container name mapping for task" dockerContainerName="ecs-serverless-dify-sandbox-3-ecs-service-connect-nconQf-94b4e0ccf1d3fdbf9d01" task="2700c02483254aef95cbc43f497f16db" container="ecs-service-connect-nconQf"
level=info time=2024-10-15T05:19:32Z msg="Created docker container for task" container="ecs-service-connect-nconQf" dockerId="fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b" elapsed=70.437094ms task="2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:32Z msg="Handling container change event" knownStatus="PULLED" desiredStatus="RUNNING" task="2700c02483254aef95cbc43f497f16db" container="ecs-service-connect-nconQf" runtimeID="fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b" changeEventStatus="CREATED"
level=info time=2024-10-15T05:19:32Z msg="Starting container" task="2700c02483254aef95cbc43f497f16db" container="ecs-service-connect-nconQf" runtimeID="fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b"
2024-10-15T05:19:32Z 200 10.0.169.54:59412 "/v2/credentials" "Go-http-client/1.1" arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db GetCredentialsExecutionRole 2 ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq arn:aws:ecs:ap-northeast-1:845861764576:container-instance/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/3929aad2e3374aa79d0480c7eb68919d
level=info time=2024-10-15T05:19:32Z msg="Processing credential request, credentialType=TaskExecution taskARN=arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db" module=credentials_handler.go
level=info time=2024-10-15T05:19:32Z msg="Started container" runtimeID="fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b" elapsed=170.946791ms task="2700c02483254aef95cbc43f497f16db" container="ecs-service-connect-nconQf"
level=info time=2024-10-15T05:19:32Z msg="Handling container change event" knownStatus="CREATED" desiredStatus="RUNNING" task="2700c02483254aef95cbc43f497f16db" container="ecs-service-connect-nconQf" runtimeID="fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b" changeEventStatus="RUNNING"
level=info time=2024-10-15T05:19:32Z msg="Start streaming metrics for container" runtimeID="fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b"
level=info time=2024-10-15T05:19:32Z msg="Writing response for v4 task metadata" tmdsEndpointContainerID="e1ab4a8e-3b73-4f70-9d72-85eba94ca012" taskARN="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:32Z msg="Writing response for v4 container metadata" tmdsEndpointContainerID="e1ab4a8e-3b73-4f70-9d72-85eba94ca012" container="fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b"
level=info time=2024-10-15T05:19:32Z msg="Writing response for v4 task metadata" tmdsEndpointContainerID="e1ab4a8e-3b73-4f70-9d72-85eba94ca012" taskARN="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:50Z msg="Sending state change to ECS" eventType="TaskStateChange" taskArn="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db" taskStatus="MANIFEST_PULLED" taskReason="" taskPullStartedAt="0001-01-01T00:00:00Z" taskPullStoppedAt="0001-01-01T00:00:00Z" taskKnownSentStatus="NONE" taskExecutionStoppedAt="0001-01-01T00:00:00Z" containerChange-0="containerName=ecs-service-connect-nconQf containerStatus=RUNNING containerKnownSentStatus=NONE containerRuntimeID=fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b containerIsEssential=true"
level=info time=2024-10-15T05:21:12Z msg="TCS Websocket connection closed for a valid reason"
level=info time=2024-10-15T05:21:12Z msg="Using cached DiscoverPollEndpoint" endpoint="https://ecs-a-5.ap-northeast-1.amazonaws.com/" telemetryEndpoint="https://ecs-t-5.ap-northeast-1.amazonaws.com/" serviceConnectEndpoint="https://ecs-sc.ap-northeast-1.api.aws" containerInstanceARN="arn:aws:ecs:ap-northeast-1:845861764576:container-instance/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/3929aad2e3374aa79d0480c7eb68919d"
level=info time=2024-10-15T05:21:12Z msg="Establishing a Websocket connection" url="https://ecs-t-5.ap-northeast-1.amazonaws.com/ws?agentHash=cf8c7a6b&agentVersion=1.87.0&cluster=ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq&containerInstance=arn%3Aaws%3Aecs%3Aap-northeast-1%3A845861764576%3Acontainer-instance%2FServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq%2F3929aad2e3374aa79d0480c7eb68919d&dockerVersion=25.0.6"
level=info time=2024-10-15T05:21:12Z msg="Websocket connection established." URL="https://ecs-t-5.ap-northeast-1.amazonaws.com/ws?agentHash=cf8c7a6b&agentVersion=1.87.0&cluster=ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq&containerInstance=arn%3Aaws%3Aecs%3Aap-northeast-1%3A845861764576%3Acontainer-instance%2FServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq%2F3929aad2e3374aa79d0480c7eb68919d&dockerVersion=25.0.6" ConnectTime="2024-10-15 05:21:12" ExpectedDisconnectTime="2024-10-15 05:51:12"
level=info time=2024-10-15T05:21:12Z msg="Connected to TCS endpoint"

You could see abnormal connectivity of between ECS and agent on EC2,

level=info time=2024-10-15T05:19:50Z msg="Sending state change to ECS" eventType="TaskStateChange" taskArn="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db" taskStatus="MANIFEST_PULLED" taskReason="" taskPullStartedAt="0001-01-01T00:00:00Z" taskPullStoppedAt="0001-01-01T00:00:00Z" taskKnownSentStatus="NONE" taskExecutionStoppedAt="0001-01-01T00:00:00Z" containerChange-0="containerName=ecs-service-connect-nconQf containerStatus=RUNNING containerKnownSentStatus=NONE containerRuntimeID=fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b containerIsEssential=true"
level=info time=2024-10-15T05:21:12Z msg="TCS Websocket connection closed for a valid reason"

After updating above ECS service to use Fargate, it works well.

        const service = new FargateService(this, 'DifyEcsClusterSandboxService', {
            cluster: this.cluster,
            taskDefinition: this.sandboxTaskDefinition,
            desiredCount: 1,
            serviceName: 'fargate-dify-sandbox',
            vpcSubnets: this.cluster.vpc.selectSubnets({ subnetType: SubnetType.PRIVATE_WITH_EGRESS }),
            securityGroups: [this.securityGroup],
        })

        service.enableServiceConnect({
            namespace: this.cluster.defaultCloudMapNamespace?.namespaceArn,
            services: [{
                portMappingName: this.sandboxTaskDefinition.defaultContainer?.portMappings[0].name || "serverless-dify-sandbox-8194-tcp",
                dnsName: "serverles-dify-sandbox",
                discoveryName: "serverless-dify-sandbox",
            }],
            logDriver: LogDriver.awsLogs({
                streamPrefix: 'serverless-dify/service-connect/',
                logGroup: new LogGroup(this, 'service-connect-on-fargate', {
                    retention: RetentionDays.ONE_WEEK,
                  }),
            })
        })

Expected Behavior

The essential container could be started.

Observed Behavior

The ECS agent could not start the container after the service connect container is started.

Environment Details

ECS agent: 1.87.0 EC2 AMI: amzn2-ami-ecs-hvm-2.0.20241010-x86_64-ebs

Supporting Log Snippets

see above description

zxkane avatar Oct 15 '24 06:10 zxkane

Hi @zxkane, thanks for opening up this issue.

You could see abnormal connectivity of between ECS and agent on EC2,

level=info time=2024-10-15T05:19:50Z msg="Sending state change to ECS" eventType="TaskStateChange" taskArn="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db" taskStatus="MANIFEST_PULLED" taskReason="" taskPullStartedAt="0001-01-01T00:00:00Z" taskPullStoppedAt="0001-01-01T00:00:00Z" taskKnownSentStatus="NONE" taskExecutionStoppedAt="0001-01-01T00:00:00Z" containerChange-0="containerName=ecs-service-connect-nconQf containerStatus=RUNNING containerKnownSentStatus=NONE containerRuntimeID=fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b containerIsEssential=true"
level=info time=2024-10-15T05:21:12Z msg="TCS Websocket connection closed for a valid reason"

What you see here is actually the expected behavior where agent disconnecting with the ECS telemetry connection (see ref where this log statement is coming from). This is also not the same ECS endpoint where we send state changes over (it's the ACS endpoint). The TCS endpoint is where we send over metrics. Agent will periodically disconnect and then reconnect back with the telemetry endpoint which you should see a corresponding log statement a bit after.

Could you help clarify a bit more on what you mean by the essential container being stuck in pending? It looks like the container did transition to a running state.

level=info time=2024-10-15T05:19:31Z msg="Handling container change event" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause" runtimeID="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9" changeEventStatus="RESOURCES_PROVISIONED" knownStatus="RUNNING" desiredStatus="RESOURCES_PROVISIONED"

If possible, could share a bit more on how the task definition is configured?

mye956 avatar Oct 29 '24 21:10 mye956

@mye956 thanks for looking at this.

I uploaded the task definition for your reference: serverless-dify-sandbox-revision1.json.

In my latest attempt, I configured both EC2-based and Fargate-based ECS services with service connect enabled, using the same task definition.

From the below screenshot, you could see the service based on EC2 is always pending for ready. image

Digging into the task of the EC2-based service, the essential container dify-sandbox always pending to be started. It also makes the ecs-service-connect become unhealthy after running a while.

image

After disabling the service connect configuration for the EC2-based service, the task runs successfully. image

Any suggestions for addressing the availability issue when enabling Service Connect for EC2-based services?

zxkane avatar Nov 07 '24 06:11 zxkane

Hi @zxkane,

Are you still having this issue? We've been having a lot of ecs-service-connect issues on our ECS cluster which is back by EC2.

We have a scheduled system where the cluster is shutdown over the weekend and brought up on Monday morning. But it's been failing to come up due to ecs-service connect failing health checks since upgrading the agent.

haphazardeous avatar Dec 04 '24 09:12 haphazardeous

Hi @zxkane, we're having exactly the same issue. By any chance, did you find any solution?

tinchogon34 avatar Dec 10 '24 16:12 tinchogon34

Hi @zxkane, we're having exactly the same issue. By any chance, did you find any solution?

No solution for EC2. Using Fargate works like a charm.

zxkane avatar Dec 11 '24 01:12 zxkane

I just realized today that with EC2, but with Amazon Linux instead of Bottlerocket, works too

tinchogon34 avatar Dec 11 '24 01:12 tinchogon34

Hi @zxkane. We will need to more logs from your container instance to investigate this issue. Can you please enable debug logs by adding ECS_LOGLEVEL=debug to your ecs configuration file (located at /etc/ecs/ecs.config by default)? Then can you please run a new task with the same configuration, collect logs from your container instance using ECS Logs Collector, and send the logs bundle to [email protected]? Please reference this Github issue in your email subject.

A PENDING task container usually is due to an unresolved container dependency.

amogh09 avatar Dec 19 '24 21:12 amogh09

@amogh09 We are also experiencing a similar issue. I sent an email with debug logs to [email protected], the subject isGithub Issue 4397 - ECS container instance DEBUG logs

alessiogaldy avatar Dec 22 '24 13:12 alessiogaldy

We've started periodically experiencing the perpetual pending status issue on EC2 backed ECS clusters. I believe we first noticed the strange behaviour around November. Only services with Service Connect enabled are affected. When a container starts showing the issue we set it to drain. New EC2 instances that are spun up by the ASG don't seem to exhibit the issue immediately. The age of the instance doesn't seem to to be related, as we've seen containers start to show the issue that are newer than ones that don't. The ECS agent version varies.

rwohleb avatar Jan 08 '25 07:01 rwohleb

What @rwohleb described is exactly what we are experiencing when using Service Connect on EC2 instances.

Every 2-3 days we find 1 or 2 instances in a broken state, looking at the logs on the instance all we can see is that the AppNet Agent failed with an error like the following:

[2024-12-17 19:55:26.759][7][error] [AppNet Agent] Envoy readiness check failed with: Get "http://127.0.0.1:9901/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[2024-12-17 19:55:26.909][7][error] [AppNet Agent] Envoy connectivity check failed with: Get "http://127.0.0.1:9901/stats?filter=control_plane.connected_state&format=json": EOF
[2024-12-17 19:55:26.909][7][error] [AppNet Agent] Envoy readiness check failed with: Get "http://127.0.0.1:9901/ready": read unix @->/tmp/envoy_admin.sock: read: connection reset by peer
[2024-12-17 19:55:26.909][7][error] [AppNet Agent] Envoy connectivity check failed with: Get "http://127.0.0.1:9901/stats?filter=control_plane.connected_state&format=json": dial unix /tmp/envoy_admin.sock: connect: connection refused
[2024-12-17 19:55:26.942][7][warning] [AppNet Agent] [Envoy process 15] Exited with code [-1]
[2024-12-17 19:55:26.943][7][warning] [AppNet Agent] [Envoy process 15] Additional Exit data: [Core Dump: false][Normal Exit: false][Process Signalled: true]

Initially we were also terminating the instances and letting the ASG create new ones but recently we noticed that just rebooting the instance brings it back to a normal state.

alessiogaldy avatar Jan 09 '25 19:01 alessiogaldy

We're seeing this issue as well, and have a case open with AWS Support. Collecting what may be related issues:

  • https://github.com/aws/amazon-ecs-service-connect-agent/issues/41
  • https://github.com/aws/amazon-ecs-service-connect-agent/issues/91
  • https://github.com/aws/amazon-ecs-service-connect-agent/issues/104
  • https://github.com/aws/amazon-ecs-agent/issues/4459

mkleinsasser avatar Jan 10 '25 21:01 mkleinsasser

@tinchogon34 Correct to assume that you mean amazon-linux-2023?

We experience the same issue with amzn2-ami-ecs-gpu-hvm-2.0.20250102-x86_64-ebs AMI for the ec2 instance...

timvw avatar Jan 13 '25 12:01 timvw

We are also affected by the issue. It started as soon as we updated from ami-05bc1b71d9fb7b17c (al2023-ami-ecs-hvm-2023.0.20240917-kernel-6.1-arm64 - ECS Agent: 1.86.3 - Docker version: 25.0.6) to ami-05c972f666f1ea926 (al2023-ami-ecs-hvm-2023.0.20250110-kernel-6.1-arm64 - ECS Agent: 1.89.2 - Docker version: 25.0.6) in eu-west-1. The AppNet Agent is killed - sometimes days after the instance started -, and never restarts:

[2025-01-15 19:30:29.505][19][info][main] [source/server/server.cc:932] all clusters initialized. initializing init manager
[2025-01-15 19:30:29.505][19][info][config] [source/common/listener_manager/listener_manager_impl.cc:926] all dependencies initialized. starting workers
[2025-01-17 12:57:27.772][7][warning] [AppNet Agent] [Envoy process 19] Exited with code [-1]
[2025-01-17 12:57:27.772][7][warning] [AppNet Agent] [Envoy process 19] Additional Exit data: [Core Dump: false][Normal Exit: false][Process Signalled: true]

The envoy process somehow dies at some point; in this case hours after the machine started (other times, it fails right as the first containers get scheduled on the instance). It appears that it got triggered by some containers being newly scheduled on the instance a minute prior to Envoy "abnormally ending".

Jan 17 12:56:44 ip-10-40-52-59 audit[1864076]: ANOM_ABEND auid=4294967295 uid=20000 gid=0 ses=4294967295 subj=system_u:system_r:unconfined_service_t:s0 pid=1864076 comm="wrk:worker_1" exe="/usr/bin/envoy" sig=11 res=1
Jan 17 12:56:45 ip-10-40-52-59 audit[1864077]: ANOM_ABEND auid=4294967295 uid=20000 gid=0 ses=4294967295 subj=system_u:system_r:unconfined_service_t:s0 pid=1864077 comm="wrk:worker_1" exe="/usr/bin/envoy" sig=11 res=1


Jan 17 12:57:27 ip-10-40-52-59 systemd-coredump[1864535]: Process 1864076 (envoy) of user 20000 dumped core.
                                                                                           
                                                                                           Stack trace of thread 46:
                                                                                           #0  0x0000aaaac1c3cc8c n/a (/usr/bin/envoy + 0x14acc8c)
                                                                                           #1  0x0000aaaac30a85c8 n/a (/usr/bin/envoy + 0x29185c8)
                                                                                           #2  0x0000aaaac30a85c8 n/a (/usr/bin/envoy + 0x29185c8)
                                                                                           #3  0x0000aaaac30b5104 n/a (/usr/bin/envoy + 0x2925104)
                                                                                           #4  0x0000aaaac30b1e80 n/a (/usr/bin/envoy + 0x2921e80)
                                                                                           #5  0x0000aaaac3059098 n/a (/usr/bin/envoy + 0x28c9098)
                                                                                           #6  0x0000aaaac36bc87c n/a (/usr/bin/envoy + 0x2f2c87c)
                                                                                           #7  0x0000aaaac36bc7cc n/a (/usr/bin/envoy + 0x2f2c7cc)
                                                                                           #8  0x0000aaaac30774b0 n/a (/usr/bin/envoy + 0x28e74b0)
                                                                                           #9  0x0000aaaac38d556c n/a (/usr/bin/envoy + 0x314556c)
                                                                                           #10 0x0000ffff8f7cd230 n/a (/lib64/libpthread.so.0 + 0x7230)
                                                                                           #11 0x0000ffff8f71b7dc n/a (/lib64/libc.so.6 + 0xdb7dc)
                                                                                           ELF object binary architecture: AARCH64
Jan 17 12:57:27 ip-10-40-52-59 systemd-coredump[1864558]: Process 1864077 (envoy) of user 20000 dumped core.
                                                                                           
                                                                                           Stack trace of thread 45:
                                                                                           #0  0x0000aaaacc85cc8c n/a (/usr/bin/envoy + 0x14acc8c)
                                                                                           #1  0x0000aaaacdcc85c8 n/a (/usr/bin/envoy + 0x29185c8)
                                                                                           #2  0x0000aaaacdcc85c8 n/a (/usr/bin/envoy + 0x29185c8)
                                                                                           #3  0x0000aaaacdcd5104 n/a (/usr/bin/envoy + 0x2925104)
                                                                                           #4  0x0000aaaacdcd1e80 n/a (/usr/bin/envoy + 0x2921e80)
                                                                                           #5  0x0000aaaacdc79098 n/a (/usr/bin/envoy + 0x28c9098)
                                                                                           #6  0x0000aaaace2dc87c n/a (/usr/bin/envoy + 0x2f2c87c)
                                                                                           #7  0x0000aaaace2dc7cc n/a (/usr/bin/envoy + 0x2f2c7cc)
                                                                                           #8  0x0000aaaacdc974b0 n/a (/usr/bin/envoy + 0x28e74b0)
                                                                                           #9  0x0000aaaace4f556c n/a (/usr/bin/envoy + 0x314556c)
                                                                                           #10 0x0000ffffa2e7d230 n/a (/lib64/libpthread.so.0 + 0x7230)
                                                                                           #11 0x0000ffffa2dcb7dc n/a (/lib64/libc.so.6 + 0xdb7dc)
                                                                                           ELF object binary architecture: AARCH64

We have 10 clusters (5 in eu-west-1, 5 in us-west-2) using EC2 instances only & ECS Service Connect. We started experiencing the problem as soon as we upgraded, although for now only on one of our 10 clusters . They are all provisioned in the same way using Terraform code though and run similar services deployed in the same way. We are rolling back to the last known working AMI we were using but I have logs to share privately if needed from affected instances.

nicolas-nannoni avatar Jan 17 '25 17:01 nicolas-nannoni

@nicolas-nannoni it looks like you're having the exactly the same issue as us. we have about 4 ECS clusters, all running on EC2. Out of 4, one is using agent version 1.85.1 and it's had no issues whatsoever.

However the rest is constantly having issues where the ecs-service-connect fails its healthcheck. due to either a new task being deployed, or an existing tasks taking a long time. The all ecs-service-connect health checks in all tasks fail. Somehow the cluster is still running and working as expected but it's stuck where it cannot launch new tasks.

Is there anyway that the agent can rollback to an earlier version? There seems to be major issues with this

haphazardeous avatar Feb 20 '25 16:02 haphazardeous

I'm also getting this intermittently. Rolling in a new set of EC2 machines can sometimes fix the problem for a few deployments before it comes back again.

Rodeoclash avatar Mar 01 '25 00:03 Rodeoclash

Hello @zxkane I have same issue pending start service. Do you know how to fix now ? Thanks

namnguyenhai avatar Mar 14 '25 06:03 namnguyenhai

Hello @zxkane I have same issue pending start service. Do you know how to fix now ? Thanks

No, since it wasn't a product project, I didn't follow up.

zxkane avatar Mar 17 '25 12:03 zxkane

Hi, we're still investigating this issue. Based on a few of the agent logs we've received where we're seeing this issue so far, it seems the issue lies with the Service Connect AppNet side car container not exiting properly and thus not restarting after exiting. For example, from the appnet relay logs

[2024-12-22 07:20:47.742][7][warning] [AppNet Agent] [Envoy process 15] Exited with code [-1]

From docker's perspective,

74801ec10446967ff221b8279ee50a1fe8fec0e839b4d67b3b821fb6744fe6f5 ecs-service-connect-agent:interface-v1 "/usr/bin/tini -- /usr/bin/agent" 35 hours ago Exited (0) 5 hours ago ecs---instance-service-connect-relay-bacfa5e1f0fbaf930a00

We've notified our service connect team on this issue.

mye956 avatar Apr 17 '25 20:04 mye956

Hi, I spend 2 days investigating, is there any fix ?

abakermi avatar Apr 24 '25 11:04 abakermi

For anyone not following other issues, #4602 was merged and is in release v1.93.1. Hopefully that helps solve this specific issue.

rwohleb avatar Jun 09 '25 16:06 rwohleb

I'm experiencing this same issue, and have just had it happen again a few days ago. This is making me believe this specific issue is still unsolved.

My "fix" is to drain the EC2 instance when the ecs-service-connect sidecar gets into an unhealthy state.

MartinPJones avatar Jun 30 '25 05:06 MartinPJones

For what it's worth, I haven't seen the issue repeat since I moved onto a more modern AMI.

Rodeoclash avatar Jun 30 '25 05:06 Rodeoclash

We deployed the latest ECS optimized Amazon Linux 2023 AMI (al2023-ami-ecs-hvm-2023.0.20250626-kernel-6.1-arm64) and ECS Agent (1.95.0), and haven't seen the issue either for over a week across 2 regions and 10 clusters. It appears to be solved.

nicolas-nannoni avatar Jul 01 '25 15:07 nicolas-nannoni

Hi, the fix for this should have gone out in ECS Agent 1.93.1 as part of https://github.com/aws/amazon-ecs-agent/issues/4459 and https://github.com/aws/amazon-ecs-agent/pull/4602. Will be closing this issue. If the problem still persists, please either re-open this issue or create a new one. Thanks!

mye956 avatar Jul 15 '25 16:07 mye956

the issue still persists, I am using this image al2023-ami-ecs-hvm-2023.0.20250828-kernel-6.1-x86_64 with ECS agent version v1.98.0, the main container remains pending and the service connect container goes unhealthy after few minutes.

Image

AmrHossam902 avatar Sep 07 '25 12:09 AmrHossam902