Too Many Concurrent Attempts to Register Task Definition in ECS Work Pool
Bug summary
When multiple Prefect flows start simultaneously using an ECS work pool, they attempt to register a new AWS task definition at the same time. This leads to a ClientException with the message: "Too many concurrent attempts to create a new revision of the specified family."
I have 5 scheduled flows that start simultaneously each day. These flows attempt to register new task definition each, likely because a new deployment version is created each night.
I've tried to set these variables in my worker Dockerfile: ENV AWS_RETRY_MODE=adaptive ENV AWS_MAX_ATTEMPTS=100 however, this didn't resolve the issue.
I'm using the same Docker image for each flow with only input parameter differing(I have enabled Match Latest Revision In Family (Optional) but it's not working)
Version info
Version: 2.20.2
API version: 0.8.4
Python version: 3.11.9
Git commit: 51c3f290
Built: Wed, Aug 14, 2024 11:27 AM
OS/Arch: darwin/arm64
Profile: default
Server type: server
Additional context
I'm using prefect-aws: 0.4.2
Task Definitions Comparison:
- Only differences are revision, registeredAt, and registeredBy.
- The Docker image remains the same: prefect-ecs-flow-image:latest.
- Flow run names remain unchanged.
Attempts to Mitigate:
- Disabled AWS logging to prevent automatic task definition creation.
- Enabled "Match Latest Revision In Family" without success.
Related Issue: PrefectHQ/prefect#10102
Environment Variables Set: dockerfile ENV AWS_RETRY_MODE=adaptive ENV AWS_MAX_ATTEMPTS=100
Task definitions created for the same deployment (differenf flows run): task_definition_1.json task_definition_2.json
+1
I'm also facing a similar issue.
+1
Same here. @torbiczuk did you find a workaround?
Not really using prefect features.Robert TorbiczukOn 23 Dec 2024, at 14:21, Samuel Hinton @.***> wrote: Same here. @torbiczuk did you find a workaround?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
Some prior lit for posterity, though none of the discussions in them provide a solution or workaround:
- https://github.com/PrefectHQ/prefect/issues/4402
- https://github.com/PrefectHQ/prefect/issues/10102
- https://linen.prefect.io/t/2432300/hi-there-having-a-slight-issue-with-the-ecs-agent-startflowr
- https://linen.prefect.io/t/16602210/hi-all-i-get-this-error-when-running-say-large-amount-of-flo
- old discourse thread 1
for folks that are using the ECS hybrid work pool, I'm pretty sure this is why this field exists. it will reuse exising definitions when possible
for folks that are using the ECS push pool, its not currently on the job configuration so we will get this option ported as soon as we can
Apologies for the double ping, @zzstoatzz , just trying to ensure information is accessible outside of slack. I don't believe the above works. I am setting match_latest_revision_in_family to True both in the deployment, having confirmed that the ECS worker we're using (not the push one) has this variable in its base job template.
await ecs_flow.deploy(
name=(name if config.name is None else config.name),
schedules=[schedule] if schedule else None, # type: ignore
paused=not active,
work_pool_name=config.work_pool_name,
image=ecr_image,
job_variables=base_job_variables
| {
"conatiner_name": "flows",
"family": config.flow.name,
"cpu": 1024 * config.cpus,
"memory": 1024 * config.memory_gb,
"match_latest_revision_in_family": True,
},
build=False,
push=False,
)
And have also tried overriding it explicitly when launching a flow on the ECS worker:
job_variables: dict[str, Any] = {"match_latest_revision_in_family": True}
...
flow_run: FlowRun = await run_deployment(
name=flow_name,
parameters=kwargs,
as_subflow=False,
job_variables=job_variables,
flow_run_name=flow_run_name,
idempotency_key=idempotency_key,
timeout=5, # give 5s to make the flow run and id which we can use to get the logs
)
Despite this, new revisions each time:
Flow1:
Retrieving ECS task definition 'arn:aws:ecs:eu-west-2:663985622336:task-definition/telemetry-cloud-raw-to-source:524'...
Retrieving most recent active revision from ECS task family 'telemetry-cloud-raw-to-source'...
Registering ECS task definition...
Task definition request{
"cpu": "16384",
"family": "telemetry-cloud-raw-to-source",
"memory": "65536",
"executionRoleArn": "arn:aws:iam::663985622336:role/prod-ecs-task-execution-role",
"containerDefinitions": [
{
"image": "596302374988.dkr.ecr.eu-west-2.amazonaws.com/nimbus/prefect-flows-datalake:prod",
"name": "prefect",
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-create-group": "true",
"awslogs-group": "/prefect-prod",
"awslogs-region": "eu-west-2",
"awslogs-stream-prefix": "prefect-prod"
}
}
}
],
"requiresCompatibilities": [
"FARGATE"
],
"networkMode": "awsvpc"
}
Using ECS task definition 'arn:aws:ecs:eu-west-2:663985622336:task-definition/telemetry-cloud-raw-to-source:525'...
Flow2:
Retrieving ECS task definition 'arn:aws:ecs:eu-west-2:663985622336:task-definition/telemetry-cloud-raw-to-source:525'...
Retrieving most recent active revision from ECS task family 'telemetry-cloud-raw-to-source'...
Registering ECS task definition...
Task definition request{
"cpu": "16384",
"family": "telemetry-cloud-raw-to-source",
"memory": "65536",
"executionRoleArn": "arn:aws:iam::663985622336:role/prod-ecs-task-execution-role",
"containerDefinitions": [
{
"image": "596302374988.dkr.ecr.eu-west-2.amazonaws.com/nimbus/prefect-flows-datalake:prod",
"name": "prefect",
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-create-group": "true",
"awslogs-group": "/prefect-prod",
"awslogs-region": "eu-west-2",
"awslogs-stream-prefix": "prefect-prod"
}
}
}
],
"requiresCompatibilities": [
"FARGATE"
],
"networkMode": "awsvpc"
}
Using ECS task definition 'arn:aws:ecs:eu-west-2:663985622336:task-definition/telemetry-cloud-raw-to-source:526'...
I tried this with a different flow just to confirm, and yeah, new task definition each time:
Apologies for the double ping, @zzstoatzz , just trying to ensure information is accessible outside of slack. I don't believe the above works.
thanks for all the detail @Samreay! from my understanding that's unexpected, so we'll take a look when we're back online
(cc @kevingrismore as domain expert)
I have the same issue as @Samreay. I've tried every solution that was created over the internet, for me it is obviously bug, I also mentioned about it on the slack but no one was interested.
finally I've created aws event oriented mechanism to retry jobs that crashed, but it was painful to discovered that it is not bug on my side
Apologies for the ping, I just want to make sure this isn't lost in the New Year holiday break. @zzstoatzz or @kevingrismore - have you had any chance to look into this?
Alternatively, if someone could point me in the write direction of the code base, I'd be happy to do a little digging myself
+1
+1, same issue here
hey folks! agreed, this is pretty annoying. I think we take create-invoke-destroy a little too literally, and there's some room for improvement here. Will update as we take this on, but should be in the next month or so (🤞 ) thanks for your patience here.
Hey @aaazzam @kevingrismore , any updates here? We are still having this issue. We need to spin up multiple quasi-concurrent ECS tasks to do a large backfill, and we are being blocked by this.
thanks @AndreaPiccione for the bump.
So my suspicion so far is kinda boring: match_latest_revision_in_family is supported as of prefect-aws==0.4.10.
The only version I see in this thread is prefect-aws==0.4.2. Is anyone on anything newer and still experiencing this?
If you modify the base job template in the UI, but your worker is using an older version of prefect-aws then the new kwargs are simply ignored. I suspect upgrading prefect-aws on each worker should make them respect match_latest_revision_in_family. I suspect that should solve this concurrent registration problem for most folks.
I think there may be something else going on here. The ECS worker has an in-memory caching mechanism that maintains a map of deployment ids and task definition arns. When a deployment is run, the task definition matching the cached arn is retrieved, and if it's found to be equal to the definition constructed from the deployment's config, registration is skipped.
match_latest_revision_in_family is effectively an out-of-memory caching fallback, constrained to the fact that it will only fetch and run an equality check against exactly one task definition - the most recent revision of the specified family or default family. This should really only be useful in the case of a cache miss where the task definition matching a deployment's config already exists, which is most commonly a worker restart.
If there's been no worker restart and you're repeatedly rerunning the same deployment, it's not match_latest_revision_in_family that's failing, but the equality check between the generated task def and the fetched task def used by both the caching mechanism and match_latest_revision_in_family.
We've just released prefect-aws==0.5.6, which highlights the differences between the generated and fetched task definitions when they don't match. If anyone facing this issue could upgrade their ECS worker to this release and add PREFECT_LOGGING_LEVEL=DEBUG to the env where your worker is running, seeing the diff would help us understand why this is happening.
Hi @kevingrismore , thanks for the quick update and for releasing a new version of prefect-aws. We have followed your suggestions, but unfortunately this is what's happening:
Retrieving ECS task definition 'arn:aws:ecs:eu-west-2:175854679451:task-definition/prefect-ecs-flow-telemetry-source:9'...
Retrieving most recent active revision from ECS task family 'prefect-ecs-flow-telemetry-source'...
Registering ECS task definition...
Task definition request{
"cpu": "16384",
"family": "prefect-ecs-flow-telemetry-source",
"memory": "65536",
"taskRoleArn": "arn:aws:iam::175854679451:role/dev-ecs-task-datalake-prefect-worker",
"runtimePlatform": {
"cpuArchitecture": "ARM64",
"operatingSystemFamily": "LINUX"
},
"executionRoleArn": "arn:aws:iam::175854679451:role/dev-ecs-task-execution-role",
"containerDefinitions": [
{
"name": "prefect",
"image": "175854679451.dkr.ecr.eu-west-2.amazonaws.com/nimbus/prefect-flows-combined:dev",
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-create-group": "true",
"awslogs-group": "/prefect-dev",
"awslogs-region": "eu-west-2",
"awslogs-stream-prefix": "prefect-dev"
}
}
}
],
"requiresCompatibilities": [
"FARGATE"
],
"networkMode": "awsvpc"
}
Using ECS task definition 'arn:aws:ecs:eu-west-2:175854679451:task-definition/prefect-ecs-flow-telemetry-source:10'...
Task definition {
"cpu": "16384",
"family": "prefect-ecs-flow-telemetry-source",
"memory": "65536",
"taskRoleArn": "arn:aws:iam::175854679451:role/dev-ecs-task-datalake-prefect-worker",
"runtimePlatform": {
"cpuArchitecture": "ARM64",
"operatingSystemFamily": "LINUX"
},
"executionRoleArn": "arn:aws:iam::175854679451:role/dev-ecs-task-execution-role",
"containerDefinitions": [
{
"name": "prefect",
"image": "175854679451.dkr.ecr.eu-west-2.amazonaws.com/nimbus/prefect-flows-combined:dev",
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-create-group": "true",
"awslogs-group": "/prefect-dev",
"awslogs-region": "eu-west-2",
"awslogs-stream-prefix": "prefect-dev"
}
}
}
],
"requiresCompatibilities": [
"FARGATE"
],
"networkMode": "awsvpc"
}
No logs about differences between the fetched and generated task definitions.
I am trying to debug myself what is happening here.
Based on the logs, it seems like it is finding a cached_task_definition_arn in the cache and retrieving a cached_task_definition on line 827, otherwise we wouldn't see the log:
Retrieving ECS task definition 'arn:aws:ecs:eu-west-2:175854679451:task-definition/prefect-ecs-flow-telemetry-source:9'...
But then it is also running line 845 otherwise we wouldn't be seeing the log
Retrieving most recent active revision from ECS task family 'prefect-ecs-flow-telemetry-source'...
In both cases, it means that cached_task_definition_arn ends up being None, which is very odd.
@AndreaPiccione The next most likely explanation is that there's an exception being raised that's being consumed without any kind of logging, and the cached_task_definition_arn is being set to None in both cases. I'm going to add some more logging elsewhere in the ECS worker today, so I can also make sure if there is an exception happening that we're surfacing it too.
prefect-aws 0.5.11 just came out and this fixed the issue for me, thank you!