hypercorn icon indicating copy to clipboard operation
hypercorn copied to clipboard

FastAPI deployed with hypercorn in GCP Cloud Run returning 503 sporadically

Open bgregoinductiva opened this issue 1 year ago • 8 comments

I have a FastAPI project deployed in Cloud Run using the hypercorn server. I'm using Uvloop as the event loop and leaving the other configurations with default values:

hypercorn app.main:app --bind 0.0.0.0:80 --worker-class uvloop

Here are the Cloud Run configurations:

  • Memory: 1 GiB
  • CPU: 1
  • Maximum concurrent requests per instance: 80
  • CPU is only allocated during request processing
  • Minimum number of instances: 1
  • Maximum number of instances: 30
  • Startup CPU boost
  • Use HTTP/2 end-to-end

When I get a peak of concurrent requests during integration testing, about 30, I usually get a 503, and then a new instance is started.

Has anyone faced a similar problem before?

Thanks in advance.

bgregoinductiva avatar May 06 '24 16:05 bgregoinductiva

Yes, based on what I have learnt so far, your instance was terminated because it accessed more memory that its defined limit.

Even though this says that Cloud Run will return a 500. In my testing I was able to prove the it actually returns a 503. Their documentation leaves a lot to be desired.

Hope this helps.

nabheet avatar May 13 '24 02:05 nabheet

We have the same issue, only at 40% memory usage at 99 percentile.

nielsbox avatar May 21 '24 15:05 nielsbox

Update: we isolated the issue to only HTTP/2. HTTP/1 seems to be fine.

nielsbox avatar May 24 '24 14:05 nielsbox

Update: we isolated the issue to only HTTP/2. HTTP/1 seems to be fine.

Is the HTTP/1 traffic encrypted? There seems to be an asyncio memory leak with SSL

pgjones avatar May 26 '24 10:05 pgjones

Update: we isolated the issue to only HTTP/2. HTTP/1 seems to be fine.

Is the HTTP/1 traffic encrypted? There seems to be an asyncio memory leak with SSL

Cloudrun terminates TLS. https://cloud.google.com/run/docs/container-contract#tls

nielsbox avatar May 27 '24 11:05 nielsbox

Also, I hate to admit this in public, but I wasn't closing SQL connections in the health check endpoint so that was leaking file descriptors. This was causing our Cloud Run containers to crash without log events returning a 503 from the Cloud Run LB.

So another thing to check would be your file descriptor count.

nabheet avatar May 27 '24 17:05 nabheet