prefect
prefect copied to clipboard
Flow run could not be submitted to infrastructure: TaskFailedToStart - 429 Too Many Requests - Server message: toomanyrequests:
First check
- [X] I added a descriptive title to this issue.
- [X] I used the GitHub search to find a similar issue and didn't find it.
- [X] I searched the Prefect documentation for this issue.
- [X] I checked that this issue is related to Prefect and not one of its dependencies.
Bug summary
Hi, We just had many flows runs failing with the error:
Flow run could not be submitted to infrastructure: TaskFailedToStart - CannotPullContainerError: ref pull has been retried 1 time(s): failed to copy: httpReadSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/prefecthq/prefect/manifests/sha256:69998e9cf3744779b98d816351e6ac837dc0eea4f450e0ad4180b6aefe995d33: 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
We are on the PRO tier:
Any idea why we are getting those errors?
Reproduction
Prefect cloud
Error
No response
Versions
Prefect cloud
Additional context
No response
Hi Yaron, are you using a Prefect Cloud push work pool?
If not, I don't think we can help you with the Docker rate limit, that's something you'd need to work out with Docker by signing up for a paid plan that gives you higher rate limits for pulling Docker images from the public registry.
Another possibility is that you could consider hosting a registry in your cloud provider, like GCP's Artifact Registry or AWS's Elastic Container Registry. They can usually be set up to act as pull-through caches for the public registries.
@chrisguidry We do use the new ECS push pools for this flow...
Ah I'm sorry, I meant a fully managed work pool (one where Prefect Cloud runs the actual workloads too). So in this case, Prefect Cloud is still sending the work to ECS in your AWS project, so it's not Prefect encountering the rate limit. I think you'll still need to pursue one of the options I mentioned: paying Docker for a higher rate limit or setting up a pull-through caching registry on ECR in your AWS project. Then you'd reference the image from your caching registry in your work pool rather than the default.
This just happened again:
We just had another spike of:
Failed due to a TaskFailedToStart error: the flow run could not be submitted to infrastructure because of a CannotPullContainerError. The error is related to fetching an anonymous token from Docker Hub, resulting in a 503 Service Unavailable response.
And also:
Failed due to a(n) TaskFailedToStart error caused by a CannotPullContainerError while attempting to pull the image docker.io/prefecthq/prefect:2.18.0-python3.10. The retries to fetch the image manifest failed due to authorization issues, likely related to a 503 Service Unavailable status.
And this one was well:
Failed due to a(n) TaskFailedToStart error caused by failing to pull a container image because of unexpected status and service unavailability when fetching an anonymous token.
Link to a failed run: failed run
~~hey, copying this from an earlier comment on the issue but you're hitting a rate limit between your aws and docker. See below for a couple of solutions:~~
~~> So in this case, Prefect Cloud is still sending the work to ECS in your AWS project, so it's not Prefect encountering the rate limit. I think you'll still need to pursue one of the options I mentioned: paying Docker for a higher rate limit or setting up a pull-through caching registry on ECR in your AWS project. Then you'd reference the image from your caching registry in your work pool rather than the default.~~
EDIT: see below. I was referencing an old screenshot with 429s as opposed to the 5xx's
its also possible that those 5xx you were seeing were related to docker's outage yesterday
hey, copying this from an earlier comment on the issue but you're hitting a rate limit between your aws and docker. See below for a couple of solutions:
So in this case, Prefect Cloud is still sending the work to ECS in your AWS project, so it's not Prefect encountering the rate limit. I think you'll still need to pursue one of the options I mentioned: paying Docker for a higher rate limit or setting up a pull-through caching registry on ECR in your AWS project. Then you'd reference the image from your caching registry in your work pool rather than the default.
We use Prefect Cloud. The images are pulled from prefect's registry.
Sorry I was looking at the wrong screenshot and saw 429 instead of 5XXs. @zzstoatzz is correct that the crashed flows look like they are related to a dockerhub outage. The errors are between your AWS account and trying to pull a public image from dockerhub. One way to mitigate this in the future if dockerhub has an outage is to use the pull through caching method mentioned above.
its also possible that those 5xx you were seeing were related to docker's outage yesterday
Yes, the time frame aligns: