granian icon indicating copy to clipboard operation
granian copied to clipboard

Graceful shutdowns considered… almost impossible?

Open hynek opened this issue 6 months ago • 8 comments

Hi,

so I've spent the better part of yesterday to understand why all my WSGI apps running in Granian running in Docker running in Nomad behind HAProxy always get SIGKILLED after a timeout although they work just fine locally – both on macOS and within a local Docker container.

It was very frustrating, because it turned out, that I ran into two independent problems. I wanted to share my findings so that my time wasn't entirely wasted and maybe suggest at least some documentation fixes.

Problem 1: Health checks & HTTP/2

The symptom here is:

[INFO] Shutting down granian
[INFO] Stopping worker-1

And then the worker just hangs forever not accepting new requests, unless I set GRANIAN_WORKERS_KILL_TIMEOUT.

This was my original suspicion based on #568 and #548. And I got it solved by setting GRANIAN_HTTP = "1" and GRANIAN_HTTP1_KEEP_ALIVE = "no".

What's puzzling here is: is it possible at all to have HTTP/2 + load balancer with health checks and graceful shutdowns/restarts?

If no, it should be clearly in the docs, if yes, please tell me how and put it into the docs. 😅

Problem 2: Weird interaction with a database driver

This one is weirder and caused by the bane of my existence: sqlanydb

The symptom is similar, but slightly different – I get only one line:

[INFO] Shutting down granian

and Granian happily does keep serving requests as if nothing happened.


I know it's the driver's fault, because if I run a local Docker container with the same app and only hit my heartbeat endpoint, I can kill it just fine.

But if I hit an endpoint that tries to connect to a database and therefore initializes the driver, I get the same behavior as in prod: it keeps serving until SIGKILLed.

And I have no solution to this except using Gunicorn where it works just fine.

I'm not sure what to do about this and I'm not sure if you're interested at all to figure out why a driver with less than 10k downloads per month is wrecking havoc like this. But I would suspect that other ctypes-based packages might have the same problem? Most people don't care about the shutdowns of their apps, but I do and I also run cleanups in atexit handlers so I'm like the 1% who would notice that.

Let me know either way, if this is something you'd be interested in pursuing. As said: you don't need a SQL Anywhere server, just the client driver which is also available here: https://help.sap.com/docs/SUPPORT_CONTENT/sqlany/3362971128.html. I could not reproduce the problem on macOS, tho. Just Linux in Docker (didn't try Linux without Docker)


granian 2.3.4, Python 3.13.5, Ubuntu Noble, mostly Flask, but also tried one Litestar

hynek avatar Jun 19 '25 07:06 hynek

hey @hynek 👋

On the keepalive thing: I'm currently working on some code changes that should allow Granian to actually close keepalived connections on shutdown. So that should fix your case, as well as #548 and #568. I just haven't decided (yet) if those changes will land in an additional 2.3 patch release or in the next 2.4 feature release. I can keep you updated on that matter.

But also, if you run Granian behind a proxy, I would definitely stick with a specific protocol, so I in general I would suggest to decide whether you want to use HTTP/1.1 or HTTP/2 between the proxy and Granian and force that with the relevant --http {version} option. That might be useful to tune some parameters, eg if you stick with HTTP/2 you can tune --http2-keep-alive-interval and --http2-keep-alive-timeout, so when Granian will be able to close keepalived connections, ideally you can properly configure those values also depending on how you configured HaProxy. Right now, I don't think there's a way to prevent HTTP/2 connections from being keepalived, as HTTP/2 uses streams, so the whole protocol is generally structured to work in that way.

As for the database driver: I don't really look at downloads/popularity of 3rd party packages to decide whether to support a good experience. I think trying to figure out what's going on can led to wider improvements, not necessarily just involving that driver. Thus: if you can provide a MRE with few lines of code that use that driver – and save me the time to also learn how to use that – I can definitely take a look at it and try to debug the whole thing.

PS/OT: love your YT videos.

gi0baro avatar Jun 19 '25 08:06 gi0baro

I'm currently working on some code changes that should allow Granian to actually close keepalived connections on shutdown. So that should fix your case, as well as https://github.com/emmett-framework/granian/issues/548 and https://github.com/emmett-framework/granian/issues/568. I just haven't decided (yet) if those changes will land in an additional 2.3 patch release or in the next 2.4 feature release

In the meantime, fyi, if you want to try the new implementation, you can build Granian from #612

gi0baro avatar Jun 19 '25 08:06 gi0baro

On the keepalive thing: I'm currently working on some code changes that should allow Granian to actually close keepalived connections on shutdown. So that should fix your case, as well as #548 and #568. I just haven't decided (yet) if those changes will land in an additional 2.3 patch release or in the next 2.4 feature release. I can keep you updated on that matter.

To be clear, I'm in no rush. I honestly also don't have any service where HTTP/1 vs HTTP/2 in the backend would make any palpable difference. I just like having nice things. :) And I'm mostly trying to wrap my head around the whole thing because my understanding is that it's currently impossible to have HTTP/2 + load balancer + health checks + graceful shutdowns and it looks like I'm not crazy but that you agree. :)

But also, if you run Granian behind a proxy, I would definitely stick with a specific protocol, so I in general I would suggest to decide whether you want to use HTTP/1.1 or HTTP/2 between the proxy and Granian and force that with the relevant --http {version} option. That might be useful to tune some parameters, eg if you stick with HTTP/2 you can tune --http2-keep-alive-interval and --http2-keep-alive-timeout, so when Granian will be able to close keepalived connections, ideally you can properly configure those values also depending on how you configured HaProxy.

FWIW, I did force it on the HAProxy side but that doesn't solve my problem from above. :)

Right now, I don't think there's a way to prevent HTTP/2 connections from being keepalived, as HTTP/2 uses streams, so the whole protocol is generally structured to work in that way.

At least none that is documented. 😇

As for the database driver: I don't really look at downloads/popularity of 3rd party packages to decide whether to support a good experience. I think trying to figure out what's going on can led to wider improvements, not necessarily just involving that driver. Thus: if you can provide a MRE with few lines of code that use that driver – and save me the time to also learn how to use that – I can definitely take a look at it and try to debug the whole thing.

It was less painful than I feared: https://github.com/hynek/granian-sqlanydb-repro

You'll just have to download + unpack the driver into the directory as explained in the README.

LMK if I can help any further. If it turns out they're doing something bad, I can ask someone to open an official ticket since we're paying for that nonsense.

PS/OT: love your YT videos.

Thank you! 💛 I should be writing right now. 🙈

hynek avatar Jun 19 '25 13:06 hynek

JFTR, Problem 1 seems to be solved by the 2.4 release! I've deployed a few apps and am watching them for now. Basic experimentation was promising.

hynek avatar Jun 29 '25 10:06 hynek

JFTR, Problem 1 seems to be solved by the 2.4 release! I've deployed a few apps and am watching them for now. Basic experimentation was promising.

Nice! As for the database driver: I hadn't chance to play with it yet, but I should have some time to look at it next week.

gi0baro avatar Jun 29 '25 10:06 gi0baro

Nice! As for the database driver: I hadn't chance to play with it yet, but I should have some time to look at it next week.

Aaaaaand? 😇

I'm kinda anxious it's gonna be some super weird internal C API state nonsense. :-/ My hope is that it's general to ctypes and not just my crap. 😇

hynek avatar Jul 08 '25 05:07 hynek

Aaaaaand? 😇

I'm kinda anxious it's gonna be some super weird internal C API state nonsense. :-/ My hope is that it's general to ctypes and not just my crap. 😇

@hynek I'm truly sorry for the – very very long – time this is taking, last month was quite messy, and I still wanted 2.5 to go out, so all of my efforts went in that direction. Now, with that out of the room, I should be able to take a proper look at this, gonna update you in the next few days.

gi0baro avatar Jul 30 '25 20:07 gi0baro

heh no worries, I've just apologized the same way on issues that are older than granian ;)

I'm just really curious about what the hell is going on there.

hynek avatar Jul 31 '25 10:07 hynek