puma Prioritised request queue

Is your feature request related to a problem? Please describe.

Currently, an application I'm maintaining (Mastodon) uses Puma to dispatch work to Rack (and its controllers)

However, this application serves roughly 3 types of requests:

User API requests
Browser HTML requests
Server-to-server API-like requests

This latter request class is especially susceptible to sudden increases in request number and volume at certain times, but with base puma, and without any other setup, these would "clog" the request queue, and make the server unresponsive to users.

In order words: Request type 1 is most sensitive to latency and retries, and request type 3 is least sensitive, and will retry resiliently.

For this problem, we are currently using a combination of nginx (to tag requests with priority) and haproxy (to queue requests with that priority) to prioritise user requests over server requests, and other.

Haproxy maintains a total number of outgoing requests to workers, and makes sure that only ones on the "bottom" of the weighted/segmented queue are immediately dispatched if a slot opens.

Describe the solution you'd like.

This above solution works for our use-case, but since this application is deployed widely, and has the same problem everywhere, I'd like for a feature to exist within puma/rails to allow for queue priority.

I'm currently looking at the Reactor class for this, which takes in new requests and selects ready ones to be dispatched to Rack in the thread pool.

For this feature, Puma would have to introspect each request when it comes in, and assign it a priority number according to some rules.

These rules can be as simple as;

URL path prefix/match (regex)
Header presence / content match (regex)

For the purposes of the application I'm talking about.

It could most likely be implemented by accepting a &block with an input of the request + headers, and an output of an integer, which'd then determine the priority of the request in the todo queue.

These rules could probably be given in puma.rb, and be shipped with the application.

This would effectively only "weigh" the todo queue, as described in the architecture documentation.

Describe alternatives you've considered.

The current setup (with nginx and haproxy) can be considered "fine", and while I have documented how to implement this, in practice only very savvy administrators are willing to implement this (and only after "thundering herds" become an actual problem), and so it'll be a limited implementation.

Jun 21 '24 11:06 ShadowJonathan

The general concept here has been brought up before.

I don't think implementing it should include any 'rule based' logic. Maybe the app provides a block to Puma, Puma passed env, and the block returns the priority as an integer or something?

Jun 21 '24 13:06 MSP-Greg

Ah, apologies. Yes, I meant to generalise it to a block, the "rule based" logic I brought up was me thinking about the way I went about implementing this in nginx, which is relatively "rule by rule".

A block from the app could check for a few if-else rules and return integer values based on that, but that's an application detail at that point.

I've tried searching for keywords for previous issues, but I haven't found any. What was the consensus on this? I believe it could be a non-invasive change, and a fruitful one.

Looking at the code, this could easily be done as a custom class that replaces the default array given to @todo in ThreadPool, as request priority would effectively internally order that array/queue, as far as .unshift is concerned.

Jun 21 '24 13:06 ShadowJonathan

With the last comment, I'm currently thinking of writing a PR for this, to at least move the conversation along a little on implementation details, and to bring it a step closer to reality.

Jun 23 '24 09:06 ShadowJonathan

Created this PR at https://github.com/puma/puma/pull/3415, reviews welcome atm, since I want to know if this is the direction that Puma wants to go into.

Jun 23 '24 13:06 ShadowJonathan

I'm not keen on this complexity living at this level. Others might disagree, but I think we have more than enough complexity in the current implementation to keep our hands full. I also think the ACCEPT behavior of puma is already sufficiently confusing that the average developer would have a difficult time navigating it. See https://github.com/puma/puma/issues/3487 for some explanations and edge cases people might not be aware of yet. Specifically: In current Puma, the server only accepts a number of requests it can handle (based on thread availability) so if prioritization was being done IN this logic, it would most commonly prioritize 1 request, which is not what you're looking for.

Annnnd....looking at your PR it seems you've discovered this.

I might suggest some kind of alternative, either a proxy in front of puma. It sounds like you've got this setup already. If that's the case I'm curious if there's a problem with it or if you're looking for something more "out of the box" with puma? If it works fine, then I might suggest documenting how to set this up. Which you've also already done.

It could possibly live in puma/puma; however, it seems a bit odd to have documents about tools we don't maintain.

Alternatively, Push the logic into the application. I.e. if you can detect that the system is overloaded somehow then have a Rack middleware that checks the URL and returns a 429, then make sure those endpoints are using some sort of client rate throttling to slow the retries https://schneems.com/2020/07/08/a-fast-car-needs-good-brakes-how-we-added-client-rate-throttling-to-the-platform-api-gem/.

Anywhoo. I'm inclined to close this. It doesn't seem like core functionality we want to maintain.

Oct 15 '24 13:10 schneems

If that's the case I'm curious if there's a problem with it or if you're looking for something more "out of the box" with puma?

I'm mainly looking for application integration; An application (Mastodon, in this case) being able to ship with a "batteries included" prioritisation system that allows the application to feel fast and smooth, no matter how many requests are thrown at it.

Currently, it has problems doing this, as there are essentially 2 (3, if we're being specific) kinds of users;

Regular (authenticated) users, these have the highest sensitivity to latency
Browser or anonymous requests and users, these can be de-prioritised a little bit
Other applications, 'federated' services, which have the lowest sensitivity to latency, and "can wait a little bit"

Returning 429 to the latter is not really an option; some federated implementations deal with timeouts better than they deal with 429s, some dont have back-off, or anything of the sort, and just does their 5 retries all within a second, so to speak. So the better option would be to have these wait until capacity arrives (either by the influx lessening, or auto-scaling to add more capacity), where they can then be properly processed.

The role of Puma in all of this would be to be able to provide a way that you "just need mastodon" for this, no extra in-between (which it looks like it increasingly needs), which would be an extra complexity step to implement, which means most sysadmins would not do that.

Oct 16 '24 11:10 ShadowJonathan

Returning 429 to the latter is not really an option; some federated implementations deal with timeouts better than they deal with 429s,

If the requests are coming from inside of mastodon to itself, presumably we could modify the code to introduce appropriate 429 behavior. If the goal is to decrease the load from these "internal" requests, it seems acceptable to introduce this rate limiting and throttling behavior to only requests it knows are from itself.

"just need mastodon" for this,

My suggestion would be to get Mastodon to move away from relying on the network for communication from itself through a public facing API. Things like zeromq as an alternative comes to mind. Do you have more context for why they do this? What's the use case?

In the process of researching for this change https://github.com/puma/puma/pull/3524 I found this issue https://github.com/puma/puma/issues/612 and it's due to the behavior you're talking about from Mastodon. It was surprising to me that someone would want to do that. It's also worth mentioning that if those requests are blocking that there's a very real race condition that could block the entire server. It's less likely if there's more than 1 thread, but it's still possible. The only way to really remove it would be to not make network requests to the same puma application.

In short: It feels like this is useful pain-point in that it's hinting to mastodon that perhaps they've made a technical decision that needs to be re-visited. I'm also interested in having a "mastodon works out of the box" experience, but not at the cost of encouraging behavior that I don't understand or condone.

Which brings me back to: Do you know why Mastodon uses this pattern? Maybe if I better understood the use-case I could suggest something more appropriate

Oct 16 '24 17:10 schneems

My suggestion would be to get Mastodon to move away from relying on the network for communication from itself through a public facing API. Things like zeromq as an alternative comes to mind. Do you have more context for why they do this? What's the use case?

This is HTTP-based federation, ActivityPub, and its communication with entirely separate domains and servers.

This is not mastodon self-feeding requests, its mostly an issue of needing to prioritise certain incoming requests based on relative merit, not inherent dependency.

The majority usecase works fine, but this is an extra request so that mastodon being able to provide request prioritisation as a "batteries included" component would make it useful.

(Also, apologies i did not reply to this soon-ish enough, did you close the issue based off of that this entire feature isn't desirable in puma, or due to the miscommunication? (Again, this is about traffic class request prioritisation, not request-dependency prioritisation))

Oct 18 '24 07:10 ShadowJonathan

Short: I was going to close the issue before I replied. I should have spelled out why, but I tend to write too much or too little. I think the request is fundamentally asking Puma to go outside of the HTTP spec and implement something that's not proven and sparsely defined. We don't have the maintenance capacity to take it on, and I'm not sure if we did we would choose to implement priority queues. I suggest focusing efforts on ActivityPub and related specifications such that servers can better communicate with each other for load-shed purposes in a way that can be implemented in a rack middleware.

Long:

did you close the issue based off of that this entire feature isn't desirable in puma, or due to the miscommunication

I was already going to close it before asking you. I was also interested in getting more info on the behavior of mastodon for my own understanding, and in turn would try to use it to also help you build your case (if at all possible).

To this specific issue: I think that at a high level Puma is not really in the "business" of providing prioritized traffic control. It seems to fall a little outside of our current scope.

Accepting any change is somewhat a balance of costs and benefits. If the change is small and it's a general purpose feature that easily used by many cases, then it's easy to merge. The flip is also true. I closed because I don't see a light-weight way towards implementing this without it being very invasive and/or difficult to manage.

I'm open to new evidence in the future or possibly a new framing. Generally it's much easier to support something framed as fixing a bug than as adding a net-new feature. Even if features fall out of that bug. However I don't "want" a priority queue implementation at this time. The existence of the open issue might mistakenly convey that to someone.

This is HTTP-based federation, ActivityPub, and its communication with entirely separate domains and servers.

This is not mastodon self-feeding requests, its mostly an issue of needing to prioritise certain incoming requests based on relative merit, not inherent dependency.

That somewhat makes sense. Thanks for that extra context. Im imagining that when mastodon is doing it's thing, there are cases where it knows content exists at a URL, but might not know that URL is served by the current running instance. Another use-case of an app pulling from itself could be an RSS reader that also has an RSS feed as a part of a blog that is served as a sidecar on a rails app (say via a /blog route in Rails). When you boot a running app, most of them don't know their own domain, so it couldn't distinguish "this is me".

That really helps give me some specific examples of why an app would have that type of behavior.

I can't pre-clear a change or pre-clear an issue, but I think a stronger argument for you would be using the general case "apps need to make requests and cannot distinguish when they're requesting from the same URL or not" with the intersection of "A puma app can deadlock if it requests from itself too much." however this is a REALLY hard problem. Even #640 doesn't eliminate the possibility, it just makes it less likely. It might be a famously unsolvable problem in some textbook somewhere. Also as I'm talking aloud, it's almost the opposite of what you're suggesting. I'm describing the case where we want to process our own requests FIRST.

AND as I go back to your're original post...I think I've completely misrepresented your original case.

Server-to-server API-like requests

You're not talking about requests from the SAME server (which was my mistaken impression). It could be, but doesn't have to be. You're talking about other servers effectively executing background jobs that are not as time-sensitive. Almost like some sort of distributed sidekiq-puma hybrid (wildly over-simplified).

So while I understand the problem space better and the request, I'm still left with the same constraints as before. I think we're fundamentally dealing with a one sided communication problem here https://en.wikipedia.org/wiki/Two_Generals%27_Problem. The current server is under heavy load, but has no mechanism to communicate that to requesting servers.

Introducing new queues also introduce new DDoS vectors. If a server-to-server request is ALWAYS behind a user facing request, then an attacker can exploit that information to queue-jump to effectively disable server-to-server requests (for example). Queues tend to be easy on the surface, and quite hard once you reach terminal conditions or edge cases.

To the future

Where I was suggesting a code change in mastodon previously, I might suggest looking into an activity pub spec change now, though in service of shaping the logic such that a request could make its way into rack middleware and then let the app code determine if it should tell the requesting server to "try again later" or not.

You can get information about the currently executing puma server via Puma.stats and we're trying to add some new ones that should help, like https://github.com/puma/puma/pull/3517. If you're behind a proxy that adds a request time, Heroku adds a HTTP_X_REQUEST_START, that will hint how long the request stayed in the TCP queue before it made its way to be processed. That's commonly used to monitor for variance. Though even with all that information, there would still need to be some sort of metric that could be used as a threshold.

Regarding the spec, I'm not sure if this doc is canonical but I found https://www.w3.org/TR/activitypub/#security-federation-dos

Servers SHOULD also take care not to overload servers with submissions, for example by using an exponential backoff strategy.

I'm a bit surprised it doesn't normalize in detail what that means. I would expect a standardized status code or some other mechanism to indicate "we are overloaded." Possibly going back to the standards committee and rasing: "let's really spell out how one server tells another server to back the F off, and what that other server MUST do to comply." Alternatively if the protocol allows for a delayed response, for example if instead of "Here's the data you asked for" what if it got back a payload with a token and was told to check back later to see if the request was fufilled, possibly with a suggested delay time.

I'm rambling, and this is quite long already. I think if I had to sum up my position it would be this: If you're building a specification or communication protocol on HTTP, you (they, the authors of activity pub) should include strong guidance on how to be a responsible server and respectful client using HTTP mechanisms. Asking puma to implement prioritiy queues on top of HTTP would turn it into a something other than an HTTP server. We already don't implement HTTP/2 and that's got a whole IETF spec with thousands of lines and multiple real-world implementations. If I had the maintenance time free, I would probably invest there instead of here.

I've laid out some interests and seeds here that could possibly be revisited in the future or in other issues. I want to see mastodon succeed, but I also need to be protective of scope and goals of this project.

Oct 18 '24 19:10 schneems

Late to the party but I just wanted to say I very much liked your write-up here @schneems – thanks for taking the time for that ❤️

To this specific issue: I think that at a high level Puma is not really in the "business" of providing prioritized traffic control. It seems to fall a little outside of our current scope.

Accepting any change is somewhat a balance of costs and benefits. If the change is small and it's a general purpose feature that easily used by many cases, then it's easy to merge. The flip is also true. I closed because I don't see a light-weight way towards implementing this without it being very invasive and/or difficult to manage.

I agree 👍

Nov 07 '24 20:11 dentarg