prefect Flow run could not be submitted to infrastructure: TaskFailedToStart - 429 Too Many Requests

First check

[X] I added a descriptive title to this issue.
[X] I used the GitHub search to find a similar issue and didn't find it.
[X] I searched the Prefect documentation for this issue.
[X] I checked that this issue is related to Prefect and not one of its dependencies.

Bug summary

Hi, We just had many flows runs failing with the error:

Flow run could not be submitted to infrastructure: TaskFailedToStart - CannotPullContainerError: ref pull has been retried 1 time(s): failed to copy: httpReadSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/prefecthq/prefect/manifests/sha256:69998e9cf3744779b98d816351e6ac837dc0eea4f450e0ad4180b6aefe995d33: 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

CleanShot 2024-02-16 at 12 01 07@2x

We are on the PRO tier: CleanShot 2024-02-16 at 12 02 41@2x

Any idea why we are getting those errors?

Reproduction

Prefect cloud

Error

No response

Versions

Prefect cloud

Additional context

No response

Feb 16 '24 10:02 yaronlevi

Hi Yaron, are you using a Prefect Cloud push work pool?

If not, I don't think we can help you with the Docker rate limit, that's something you'd need to work out with Docker by signing up for a paid plan that gives you higher rate limits for pulling Docker images from the public registry.

Another possibility is that you could consider hosting a registry in your cloud provider, like GCP's Artifact Registry or AWS's Elastic Container Registry. They can usually be set up to act as pull-through caches for the public registries.

Feb 16 '24 13:02 chrisguidry

@chrisguidry We do use the new ECS push pools for this flow...

Feb 16 '24 18:02 yaronlevi

Ah I'm sorry, I meant a fully managed work pool (one where Prefect Cloud runs the actual workloads too). So in this case, Prefect Cloud is still sending the work to ECS in your AWS project, so it's not Prefect encountering the rate limit. I think you'll still need to pursue one of the options I mentioned: paying Docker for a higher rate limit or setting up a pull-through caching registry on ECR in your AWS project. Then you'd reference the image from your caching registry in your work pool rather than the default.

Feb 16 '24 21:02 chrisguidry

This just happened again:

CleanShot 2024-05-19 at 10 36 54@2x

CleanShot 2024-05-19 at 10 37 11@2x

https://app.prefect.cloud/account/8eed9803-456a-4126-a7f7-074aa44aa1b2/workspace/8ff919f1-a2c3-4660-9ab5-66b57758d46d/runs/flow-run/83610ac5-695e-4db7-ba5e-65f682c788bd?tab=Logs

May 19 '24 07:05 yaronlevi

We just had another spike of:

Failed due to a TaskFailedToStart error: the flow run could not be submitted to infrastructure because of a CannotPullContainerError. The error is related to fetching an anonymous token from Docker Hub, resulting in a 503 Service Unavailable response.

And also:

Failed due to a(n) TaskFailedToStart error caused by a CannotPullContainerError while attempting to pull the image docker.io/prefecthq/prefect:2.18.0-python3.10. The retries to fetch the image manifest failed due to authorization issues, likely related to a 503 Service Unavailable status.

And this one was well:

Failed due to a(n) TaskFailedToStart error caused by failing to pull a container image because of unexpected status and service unavailability when fetching an anonymous token.

Link to a failed run: failed run

CleanShot 2024-06-21 at 18 25 33@2x

CleanShot 2024-06-21 at 18 28 50@2x

Jun 21 '24 15:06 yaronlevi

CleanShot 2024-06-21 at 18 34 44@2x

Jun 21 '24 15:06 yaronlevi

~~hey, copying this from an earlier comment on the issue but you're hitting a rate limit between your aws and docker. See below for a couple of solutions:~~

~~> So in this case, Prefect Cloud is still sending the work to ECS in your AWS project, so it's not Prefect encountering the rate limit. I think you'll still need to pursue one of the options I mentioned: paying Docker for a higher rate limit or setting up a pull-through caching registry on ECR in your AWS project. Then you'd reference the image from your caching registry in your work pool rather than the default.~~

EDIT: see below. I was referencing an old screenshot with 429s as opposed to the 5xx's

Jun 21 '24 15:06 jakekaplan

its also possible that those 5xx you were seeing were related to docker's outage yesterday

Jun 21 '24 15:06 zzstoatzz

hey, copying this from an earlier comment on the issue but you're hitting a rate limit between your aws and docker. See below for a couple of solutions:

So in this case, Prefect Cloud is still sending the work to ECS in your AWS project, so it's not Prefect encountering the rate limit. I think you'll still need to pursue one of the options I mentioned: paying Docker for a higher rate limit or setting up a pull-through caching registry on ECR in your AWS project. Then you'd reference the image from your caching registry in your work pool rather than the default.

We use Prefect Cloud. The images are pulled from prefect's registry.

Jun 21 '24 15:06 yaronlevi

Sorry I was looking at the wrong screenshot and saw 429 instead of 5XXs. @zzstoatzz is correct that the crashed flows look like they are related to a dockerhub outage. The errors are between your AWS account and trying to pull a public image from dockerhub. One way to mitigate this in the future if dockerhub has an outage is to use the pull through caching method mentioned above.

Jun 21 '24 15:06 jakekaplan

its also possible that those 5xx you were seeing were related to docker's outage yesterday

Yes, the time frame aligns:

CleanShot 2024-06-21 at 18 46 53@2x

Jun 21 '24 15:06 yaronlevi

prefect
prefect copied to clipboard

Flow run could not be submitted to infrastructure: TaskFailedToStart - 429 Too Many Requests - Server message: toomanyrequests:

First check

Bug summary

Reproduction

Error

Versions

Additional context

prefect prefect copied to clipboard

Flow run could not be submitted to infrastructure: TaskFailedToStart - 429 Too Many Requests - Server message: toomanyrequests:

First check

Bug summary

Reproduction

Error

Versions

Additional context

prefect
prefect copied to clipboard