mold_worker alternative to fork_worker
Description
See discussion at https://github.com/puma/puma/issues/3596 (closes #3596 )
We implemented a version of the "promote to mold" solution that we have been dogfooding at Instacart for a couple of weeks and have shaken the lingering bugs out of, and the results have been as desired, so putting up a draft PR while working through the upstreaming requirements (coverage, documentation, etc.)
mold_worker is similar in concept to fork_worker except it solves some stability issues and can get to improved performance significantly faster with lower memory consumption. In your config file, just add a call like mold_worker 400, 800, 1600 (it takes 0..n integer arguments, default with no arguments is 1000) and each time a worker passes that threshold of request count, it will get promoted to a mold and trigger a phased restart of all the other workers. Loss of a worker will also trigger a mold promotion if no mold exists, or just a refork from the active mold if one does.
~It is worth pointing out that this comes with significant risk of orphan processes, as your workers end up n generations removed from the original cluster parent. This can be resolved with PR_SET_CHILD_SUBREAPER as discussed in the linked issue. We have a small wrapper library around this functionality available here (coming soon to rubygems) [Realized while setting this up that this gem in fact already exists, at https://rubygems.org/gems/child_subreaper (a name so good that, like crabs or trains, it is an inevitable outcome). So h/t to @byroot for that]; you can just installchild_subreaper and somewhere in your app initialization run ChildSubreaper.enable to mark the original cluster parent process as a child subreaper; that way when mold promotion or termination takes place, any workers of the terminated mold will end up as child processes of the original cluster parent rather than becoming orphaned.~ [EDIT: This has been inlined so you should not have to worry about it, assuming you are on a Linux system where it is available. If you are not, it is worth pointing out that Puma does use a set of pipes to communicate pids back and forth so the cluster parent can generally still track where workers are even if they get orphaned, but it does make some things a lot harder and open you up to edge cases, so consider carefully whether it's worth it for your use.]
Your checklist for this pull request
- [x] I have reviewed the guidelines for contributing to this repository.
- [x] I have added (or updated) appropriate tests if this PR fixes a bug or adds a feature.
- [ ] My pull request is 100 lines added/removed or less so that it can be easily reviewed.
- [x] If this PR doesn't need tests (docs change), I added
[ci skip]to the title of the PR. - [x] If this closes any issues, I have added "Closes
#issue" to the PR description or my commit messages. - [x] I have updated the documentation accordingly.
- [x] All new and existing tests passed, including Rubocop.
you can just install it and somewhere in your app initialization run
ChildSubreaper.enable
I haven't looked at this PR diff, but for the record that gem was initially intended as a Pitchfork dependency, but ultimately I figured it was silly to have such a small dependency. Given Puma already ship with a C extension, I'd inline that code in puma.
So to test it I just change fork_worker to mold_worker?
Other callbacks like on_refork are all the same? (No change needed
on_refork doesn't apply to mold_worker (at least right now) because it doesn't make any sense in the context; the mold process isn't doing anything between reforks that would require a callback to prepare. There is a on_mold_promotion hook instead for the point when a worker changes to a mold that can handle similar kinds of cleanup if needed. (Open to discussion on these points, it just didn't make sense to me to overload existing hooks when some of the behavior stays the same but some is fundamentally different).
You can test it out just by switching the DSL commands, yes, but mold_worker will do best with 1) a relatively higher worker count, 8+ I would say, and 2) multiple mold promotion intervals, e.g. mold_worker 400, 800, 1600, 3200 and possibly further (or earlier) depending on your app. I would put this in an even-more-experimental status than fork_worker, we are using it widely in production but the process in getting it there involved a lot of incremental rollout and close observation. That said, happy to answer any questions if you want to take the leap and dogfood it, the more people try it out the more stable the feature can be.
@MSP-Greg adding CI coverage and docs is just making the PR bigger; should I be following @joshuay03 's lead and splitting out as much as I can or will larger PR's still be considered for review? there's a couple parts that could slightly benefit fork_worker on their own that might shave ~50-100 lines off, and other parts that could ship as dead code if maintainers would prefer several smaller changesets. My preference is to keep it together where possible since the separable changes are all no- to low-impact on their own but I'm not the one who has to review it.
@toregeschliman
Thanks for working on this, and thanks to instacart for supporting the work.
splitting out as much as I can or will larger PR's still be considered for review?
I've looked at the code, and with PR's pertaining to new options/features (or large refactors), splitting up the code can be a PITA, without a clear benefit...
From your comments, it sounds like the code has been well tested, so I'm okay with the current PR structure.
Off topic, with PR's related to bugs, it's very helpful to have the test commits before the 'lib' commits. Makes it easy to checkout the test commit(s) and see what tests are failing...
Greg
it's very helpful to have the test commits before the 'lib' commits
fully agree, but for the next week or so I'm mostly just celebrating surviving the integration specs... I feel https://github.com/puma/puma/discussions/3034 in my bones after working through those.
We have been running it for about 3 weeks in prod in largely the same form (2 bugs fixed in that time, a minor memory bug and a race condition), hitting 10's of thousands of processes at peak, so it's been through it's paces. We've also run it on normal Puma setups (no fork_ or mold_worker) and haven't seen any issues with them. One case we haven't really had real-world coverage over is existing fork_worker setups; after the investigation leading to this we don't really trust it enough to run in prod. I don't think it overlaps with any of the changes for mold_worker, maybe in one case they || to the same path but otherwise should be unaffected.
@toregeschliman
I wasn't suggesting rearranging the test commits, as this is a 'options/features' PR.
I'm mostly just celebrating surviving the integration specs
I constantly have the tests pass locally (I can test on all three OS's), but once the changes hit GHA, intermittent failures occur. Can be really frustrating.
to their credit, in local they actually did uncover issues (I didn't understand USR1's purpose at all really until I dug in and noticed it wasn't defined behavior), but they are not the easiest to unwind if you're starting out with a naive cargo cult. In GHA they always resolved after a retry at most which is pretty par for the course on complex tests.
Running following on a small app (too lazy to have decent memory monitoring) for a week
mold_worker(400, 800, 1600, 3200)
on_mold_promotion do
# Run GC before forking
3.times {GC.start}
end
Would it be possible to explain in the docs exactly what these numbers are? mold_worker(400, 800, 1600, 3200) Does that mean after the cluster receives each of 400, 800, 1600, and 3200 requests since cluster start, the mold is promoted, and after that it never gets promoted again?
mold_workeris capable of reforking at multiple intervals; by default it will trigger one mold promotion and a phased refork at 1000 requests, but you can pass 1..n intervals instead and it will trigger a refork as it passes each
To my understand for fork_worker it's boot -> 1000 requests (or whatever you set) -> refork -> Nothing else
mold_worker it's boot -> 1000 -> mold and refork from mold -> Nothing else
And additionally if you specify multiple intervals like 400, 800, 1600 then
Boot -> 400 -> mold and refork -> 800 -> mold and refork -> 1600 -> mold and refork -> Nothing else
There is no option for infinite reforking (boot -> 1000 -> refork -> repeat forever) AFAIK (I read the code
@wlipa @PikachuEXE 's explanation here is correct, which I tried to explain in the documentation here; please let me know if the explanation is too dense or poorly worded and I can flesh it out a bit more.
I was just not sure if it was requests per cluster or per worker process. If you added "Request counts are per cluster since boot." it would be totally clear to the likes of me.
AFAIK puma doesn't track a request count per cluster anywhere, they are all per-worker; regardless I have updated the doc in my branch of final changes, just waiting for any additional feedback.
should I be following @joshuay03 's lead and splitting out as much as I can or will larger PR's still be considered for review?
I think having everything in this one PR is completely fine. The thing I split out is generic enough to warrant its own PR, and doing so gives us one less (refactor) change to grok.
I'm very keen for this! I'll try my best to review over the next week or so.
For every worker_check_interval (default 10s) there is a ping
And every worker will be pinged (in sequence I guess
If requests_count > next_request_interval (when present) e.g. for 400, 800, 1600 it's 400/800/1600 then nothing
The first worker pinged matching above condition is passed into mold_and_refork!
And if there is a signal SIGURG (sent to manager?)
mold_and_refork! is called without argument and will use worker with largest requests_count
(from reading the source code might be wrong~
(Reply for I was just not sure if it was requests per cluster or per worker process. in https://github.com/puma/puma/pull/3643#issuecomment-2787237320
that is correct, yes; the only other nuance is that when you refork from a mold, all the new workers inherit the mold's request count, so after the first refork every worker starts at 400 requests and only has to do 400 more to trigger the next mold and refork (at which point all the new workers will start at 800 request count, etc.)
I am running a branch at https://github.com/toregeschliman/puma/pull/13/files for any final feedback, IMO it's easier to review when you don't have a moving target and hopefully there's nothing substantive left.
after the first refork every worker starts at 400 requests and only has to do 400 more to trigger the next mold and refork
I think this is not quite clear in doc (even with that PR doc update
Probably should describe clearly that the values passed into mold_worker are no. of requests since the last boot/restart instead of since last mold promotion with the example you just provided
Update: not clear
anyone have a chance to dogfood and got feedback they could share?
Not even deployed to staging yet Production next week at most Wait 2 weeks at least (I gotta monitor the RAM usage
Update 1: Deployed to production, but since RAM usage only raise to highest during weekend (we deploy daily) we will wait 1 week to collect enough data
Before enabling mold_worker
Tested with 400,800,1600,3200,6400 and too much memory was used
Disabled next week and try with less refork (probably default value) the week after
@PikachuEXE how many workers are you running? there should be no additional cost to extra refork cycles (new molds replace old ones so they don't accrue over time) but the extra mold process does take up some additional private memory so if you want net memory savings you need a fairly high (10+) worker count for that to balance out. You can try running Process.warmup in your on_mold_promotion hook to help, though depending on your setup that might do more harm than good; we found it increased private memory of the cluster parent (likely because the GC.compact triggered some COW faults?) but didn't reduce private memory of the mold process, possibly because we use jemalloc which IIUC no-ops the malloc_trim component of warmup.
As mentioned above and in the issue this resolves, the arguably-bigger net benefit of this mode is reduced warmup costs; a worker restart after the final refork generation should be significantly faster than a brand new one, which typically results in lower p99+ latencies. This is more significant if you are running YJIT since a mostly-warmed cache there is a world of difference from a cold start.
@MSP-Greg sorry for the nudge, but anything I can do to move this forward? It's getting harder to justify running off a personal branch for as long as we have. I've got two minor changes (docs and eliminating a log line on shutdown) ready to go here, just waiting to hear if there's anything else that needs addressing.
@toregeschliman
I apologize. I'm involved with the setup-ruby code for GHA, and Windows builds recently 'exploded'. I hope to be finished with that very soon, Puma (and this) follows...
EDIT: I doubt Windows Ruby is used in the cloud that much, but there are people... Regardless, testing with Windows has come a long way, and a lot of repos test against it. I hope that continues, but when CI starts failing...
It's getting harder to justify running off a personal branch for as long as we have.
What's the alternative?
The alternative I'd hope for is the feature gets merged so we could run off of master and then eventually a normal versioned release; it makes normal maintenance easier and reduces the perceived risk of maintaining a permanent fork into the future, which we were trying to avoid by upstreaming the feature. I was just checking to see if there was anything I could do to get us to that lower-risk state, but sounds like it's just a capacity issue, which I completely understand.
Not seeing too much memory saving before/after enabling mold_worker...
Will run for a few more weeks
Update 1: Maybe should update the threshold to a later point...
Update 2: Tried 10000 but getting inconsistent results out of 3 VMs, maybe someone manually restarted/deployment went wrong, reviewing it again next week...
Update 3: Not much changed after another week
Update 4: Using 80000,160000,320000,640000
Update 5: 80000,160000,240000,320000,480000,640000
👋 Hi, are there any updates on this? Another issue comment mentioned that this feature could be used to support gRPC clients in Puma. That would be extremely useful for our team 🙏
@abbas-square can your team review this? Can you try out the changes? Can you run this in production for some time?
The community needs to come together and own this and push Puma forward, there's no magic silver bullet here :)