perf: collect production profiles to enable PGO
Profile Guided Optimization can significantly improve performance of Go programs.
To enable this optimization, we need to continuously collect profiles from a production environnement.
https://go.dev/doc/pgo
Just randomly saw this and thought I'd chip in. I've done this (on libphp, not the go code) with a sample symfony app served in worker mode and got around 15% better throughput. The issue is that a cli benchmark performed ~7% worse afterwards and I don't know if we really want to offer two, three or more different configurations for users to pick the right one in their scenario.
When it comes to optimising the Go code, I'm afraid there will be very, very little improvement possible or necessary, because most time is spent on php in real world examples. Directly serving a hello world via caddy achieves ~70x more throughput than a hello world in a one-line php script.
Thanks for the hint!
Did you use GCC?
What could be nice is to provide a script to easily compile a PGO version of FrankenPHP, so advanced users can do it easily.
Yes, using gcc. Because of how much performance gcc global registers provide, compiling with clang is sadly not a realistic option. PGO and LTO together make up less ground than global registers do. Makes me wonder... do you know if the macos homebrew version is compiled with real gcc?
What could be nice is to provide a script to easily compile a PGO version of FrankenPHP, so advanced users can do it easily.
I will integrate it into static-php-cli to make it easily accessible to users. For advanced users it's already very simple - initially compile with -fprofile-generate, run load testing, then recompile with -fprofile-use. I don't think this optimises the go code, but the cgo and (more importantly) the libphp code.
Edit: this makes it seems like it's clangs fault that gcc is so much faster. It's not, it's just that php-src heavily uses the global register variable gcc extension to optimise code. Compiling with --disable-gcc-global-regs actually produces sloghtly slower code with gcc than with clang.
I didn't think of optimizing the c code, but that makes sense for sure. I'd love to see more details on how it's done and some crude benchmark.
And if you're going down that road, it seems like a small lift to also do golang pgo as well - you just build, create cpuprofile and rebuild. I've seen 5% improvements in the past. I'd also be curious to see effect of go 1.25's green tea GC, and json/v2.
though it'll be negligible in the context of a php app, as you said. Would it help more in worker mode? And perhaps it may help particularly if making golang-based php extensions?
It would help more in worker mode, but even there the potential gains would be negligible.
Caddy's side of a php script call takes around a tenth of a millisecond for me. The slow Go - C bridge context switches can't be optimised. The php script will most likely, in a real application, take around 10-60ms (at least that's what my rather large symfony project takes in production).
In the best case it would net a 15% gain in the part that takes up less than 1% of the total time.
Yeah, I agree. I even shared similar feedback long ago when I saw folks doing a lot of work trying to optimize the Go code for irrelevant (eg techempower) benchmarks. Better to focus energy on the php/c side, ergonomics, amazing features like creating php extensions in Go, documentation etc