wptop
wptop copied to clipboard
Honour disabling yourself when disabled in the Wordpress dashboard
I recently had to fix a customer machine where they'd been running all their web requests with full profiling enabled (wptop_enable()
) . That's really more their problem from oversight in development/ops than it is a problem of the plugins.
Perhaps you could raise a fatal error if profiling is started, but the plugin is disabled, I'm not sure.
I'm actually not a Wordpress or PHP developer, just a guy who got lucky solving someone's server hosting hell for them.
So they had deactivated the plugin but left wptop_enable() on? I think adding the http://codex.wordpress.org/Function_Reference/is_plugin_active check and suppress enable/disable would work in honour of disabling the plugin.
That's exactly what they'd done, so it was causing pretty hellish problems. They were surprised that it was still running "but we disabled that weeks ago!?", and told me categorically that they had no profiling plugins running.
The smoking gun was that I couldn't actually strace
the processes, at first because of the dynamic pooling … then with a fixed php-fpm pool it was easy to see they were spending the majority of their time in getrusage
, I couldn't see the sample frequency in the code but more than 99% of the samples of the process according to strace
were in getrusage
. I spent some time looking around for why, and not being a PHP guy, concluded that it must be something about PHP-FPM master/slave resource management… since… as they'd said, they weren't profiling anything (also google searches for php getrusage
lead to nothing but something about tuning a PHP system that talked to an Oracale database, which was spitting out too much getrusage).
Long story short, no data was being logged, and we found wptop_enable()
in wp-config.php
. I'm mostly writing the issue so that the next poor soul who has to debug it might stumble upon this.
Thanks. Acknowledged and will fix later today. How detrimental would you say the plugin was to application performance?
Setup was 7 servers, 5 responsible for web. 8 cores (3.+ghz 64bit intel), 12Gb of RAM and all SSDs… the following script was enoguh to cripple the site with the plugin enabled:
for i in $(seq 1 15); do
curl --with-fair --options-for-keep-alive -H 'cookie: wp-login-cookie' http://their.site/wp-admin/ &
done
The result was a near instant failure of PHP-FPM, since when being traced the worker processes don't seem to respond to signals.
A single request would respond fairly quickly, 45-50sec, with the plugin disabled, we're closer to 3-4 sec response times now which is as good as we can get (they have ~40-50 WP plugins, including some large ones line buddy press and LMS, which we are still trying to tune).
With wptop
really disabled (no call to wptop_enable()
) I basically can't kill the workers anymore, although they do seem to leak, that's solved with pm.max_requests
, and works relatively well.
I'll point out whilst I'm here too that although they were using SSDs, they had a bind-mount of a remote NFS over the webroot, which causes things that should be fast like lstat(...)
to be way slower than they should be, which also compounded the problem, since with remote locks on the NFS drive, and all syscalls taking fractionally longer because of the trace, the lock contention was much more of a problem than it ought to have been,
Interesting. Was /tmp/ mounted over NFS as well?
/tmp
wasn't being used, in ~~the nginx config~~ edit: it was in the php-fpm pool config they were exporting some variables to PHP-FPM which I think effectively had the effect of using a different tmp
, which was set to <webroot>/tmp
(so, yes.)
Oh, just realized, that wouldn't matter; all profiles are stored in the database, not the flat filesystem.
Also looking at the code I realize that since wptop_enable
is designed to work without a WordPress context at all, loaded as early as possible, there's no way to identify whether the WordPress plugin is active (this check needs a database connection).
The result was a near instant failure of PHP-FPM, since when being traced the worker processes don't seem to respond to signals.
So what sort of failure? Did they freeze? Segfault? And what sort of signals were being sent.
It's quite surprising to see XHProf, a production-target profiler, fail so miserably.
So what sort of failure? Did they freeze? Segfault? And what sort of signals were being sent.
Hard to say, they locked up, essentially, but without ever dying
as far as the system is concerned. If they had entered uninterruptible IO (process state "D"), and stayed there, the system would have signalled them eventually, or PHP-FPMs timeouts would have done... (SIGTERM
, or something I'd have guessed)
It seems however that because they never enter state D
because the profiler keeps waking them up (also via signals, I think), then they are always state "R" (running). Because they were still "Running" things like iotop
and the munin monitoring them shows them essentially as doing 0% IO, since the processes are never truly asleep waiting for IO to finish.
The interesting thing is that it was like a live-lock, the CPU and RAM were essentially idle and empty, and it was impossible to strace or measure the processes because of the xhprof
getrusage
calls spamming all the reports.
Also processes that never sleep don't always respond well to signals, as signals are buffered in the kernel, so if our processes were slowing down because of too many profiling signals the might be failing to drain the queue fast enough to respond to signal handlers for hard-timeouts, and other PHP-FPM pool management signals
Also looking at the code I realize that since wptop_enable is designed to work without a WordPress context at all, loaded as early as possible, there's no way to identify whether the WordPress plugin is active (this check needs a database connection).
Indeed, I noticed this too, and of course that makes sense, since loading Wordpress itself has a cost, and that cost is an important part of deciding how to configure something like an object cache or APC.
In your shoes I might be tempted to set a variable (global? eek.) somewhere in wptop_enable()
, and raise a fatal warning if then it turns out that when the wp context is loaded, that the plugin is disabled. At least to warn people that they need to do more to disable it.
Or, perhaps it's enoguh to fix the documentation in the wordpres admin UI that the plugin is only the report viewer, but that profiling will happen regardless as long as the "# Installation" guidelines were followed, as that exists before the WP context is up.
I'll think about adding a message upon plugin deactivation to remind people to remove the enable scaffolding. wptop_enable
can't hook into the WordPress context at all, and since the plugin code is disabled and not running there's no way to run custom code in WordPress related to wptop
, maybe during a shutdown
function; like if during shutdown it's found that wptop
as a plugin was never run emit a warning into logs.
I wonder whether the problem lies in not running xhprof_disable. Or were the same issues observed with the plugin active?
We didn't enable the plugin again, but that's a valid point, if it's never stopped, where does the trace log go, if anywhere?
I'm guessing traces should be kept in memory of XHProf internally (I'd have to look at their source code later today), but who knows what happens to them if they're not stopped/dumped, especially in a static worker setup, maybe it keeps accumulating allocating heap for each request and never freeing it up. Will know more in a couple of hours after having a chance to look at XHProf closely.
Cool: note the static change I made was mostly so that I could strace
a known PID with some degree of certainty that it'd stick around. They were using a dynamic pool, with timeouts of 360s
(:tears:)
Perhaps a saner installation option might be to recommend that profiling runs on ~1% of requets…
<?php if (rand(0, 100) == 42) { wptop_enable(); } ?>