php-pm icon indicating copy to clipboard operation
php-pm copied to clipboard

Ubiquity php-pm High concurrency level benchmark

Open jcheron opened this issue 5 years ago • 45 comments

Hi, I implemented a PHP-PM adapter for Ubiquity framework, which I test the performance on TechEmpower benchmarks. For the moment, the results are not conclusive (see Ubiquity and Ubiquity-react).

For the Multiple Queries test, which requires a connection to the database and could have benefited php-pm, I have the following results:

Responses per second for multiple queries test (1, 5, 10, 15, 20), concurrency level of 512: image

The source code and configuration are available on TechEmpower repository. The PHP-PM test (currently called Ubiquity-react) uses php-pm as a web server and load balancer. Dockerfile to build & run is ubiquity-react.dockerfile.

I have the same symptoms as @oozone with Laravel (see https://github.com/php-pm/php-pm/issues/453): poor performance, and warning on some requests:

Script did not return a valid HTTP response. Maybe it has called exit() prematurely?

I also tried to manually test the script with ab: At concurrency level 5, php-pm is more efficient than php-fpm PHP-PM image Apache image

If I increase the concurrency level, performance degrades for php-pm: PHP-PM image Apache image

TechEmpower multiple query tests are run at 512 concurrency... At this level, my php-pm use generates warnings. The results are the same for the other tests (Plaintext or Json).

I wonder if I'm doing something wrong somewhere. The adjustment of the number of workers does not change much in this context.

jcheron avatar Jun 06 '19 01:06 jcheron

Ping @marcj I could imagine that degrading performance at higher concurrency lavels might be due to only having a single incoming process for all connections? What really discomforts me again is the unclear slave not responding errors. Do you have any idea how to improve logging to catch those?

andig avatar Jun 06 '19 07:06 andig

@jcheron regarding your config you should adjust the number of workers with the level of parallelism. Each worker will only handle a single request at a time. Once you have more requests than the workers can handle the requests will start queuing up at the master and worst case time out before handed to the slave. Ppm log output for your test cases would be helpful, too.

Also- as far as Ubiquit goes- are you really sure that you‘re not starting an entire application instead of just processing a request in https://github.com/phpMv/ubiquity-php-pm/blob/master/src/PHPPM/Ubiquity.php#L55?

andig avatar Jun 06 '19 08:06 andig

@andig Thank you for your answer

Do you have any idea how to improve logging to catch those?

When I activate the logging at php-pm server starting, all requests return a status of 200 until we get to the warnings. But I have no additional information.

image

with -vvv:

image

If I increase the total number of requests to be made, I realize that the problems always occur on the last 10 percent. If I increase the number of workers, (128 for concurrency 256), it doesn't change anything.

are you really sure that you‘re not starting an entire application instead of just processing a request

Yes, the Startup::run method just handle the request. The framework is started by including the file services.php, which uses a config.php file adapted to php-pm.

jcheron avatar Jun 06 '19 10:06 jcheron

I didn't notice anything wrong on the framework side by enabling all the log and error display possibilities.

By removing any reference to the framework in the handle method of the Ubiquity bridge:

public function handle(ServerRequestInterface $request): ResponseInterface {
	return new \React\Http\Response(200, ['Content-Type'=>'text/plain'], 'Hello World!');
}

I still have the same warnings with concurrency 512 vendor/bin/ppm --bridge='\PHPPM\Ubiquity' --bootstrap='\PHPPM\Ubiquity' start --debug 0 --logging 0 --workers 32 --max-requests 1024 --port=8090

ab -n 1000 -c 512 http://127.0.0.1:8090/foo

image

jcheron avatar Jun 06 '19 18:06 jcheron

Do you see any chance of adding logging in the bridge? Does the bridge even receive those stray requests or does the redponse somehow get stuck?

andig avatar Jun 06 '19 19:06 andig

I added logging in the bridge. But stray requests do not go through there.

jcheron avatar Jun 06 '19 22:06 jcheron

Then the next step would be to find out if the slave even receives the request or if it gets stuck in the master. Could you also find out if the slave ever sees an incoming connection in those cases?

andig avatar Jun 08 '19 07:06 andig

Until we find a solution to the high concurrency level problem, I will remove the Ubiquity-react (php-pm) benchmark from the TechEmpower results.

For now, it's bad publicity for php-pm as much as it is for Ubiquity.

jcheron avatar Jul 08 '19 17:07 jcheron

Hi @jcheron the problems were the same when running 512 php-pm slaves? As I can see in your repo, php-pm is only given 64 slaves, no?

acasademont avatar Jul 11 '19 15:07 acasademont

@acasademont I gradually increased the number of workers: 16, 32 and 64, with no impact on performance so far. I concluded, perhaps hactorily, that the problem was elsewhere. Initially, I understood that the number of workers had to be correlated to the core number of the processor, and not to a concurrency level that we don't have control in production.

Some of the TechEmpower tests go up to a concurrency level of 16 384 (Plaintext benchmark)... But I'd be willing to make an attempt at 512.

jcheron avatar Jul 11 '19 23:07 jcheron

@jcheron php-pm behaves exactly the same way as php-fpm, it can only process one request at a time per worker. Therefore, if you use a concurrency level of 512 but you limit the number of workers to 64, you're artificially limiting the number of requests than can be managed concurrently.

In an ideal world you're right, the number of workers should equal the number of processors (it's what you do in nginx, for example), but php is not an async language, any IO call would block the whole process. That's why php-pm decided to go for the safer approach of limiting one concurrent request per worker.

If you have a min_children directive set in the php-fpm config, the same number should be given to php-pm. We might then encounter other problems (for sure!), but at least we're giving php-pm the minimum number of resources it needs :p

acasademont avatar Jul 12 '19 09:07 acasademont

@acasademont we still have the Script did not return a response problem that ~~we've also seen in another, very simple Symfony app benchmark~~ we're seing even without Ubiquity above. I couldn't find what causes those but ssupect that it might also have an impact on avg response times. Any idea what might be causing those?

andig avatar Jul 12 '19 09:07 andig

Yes, definitely, we should address that problem too, but I wanted to start with a fair comparison first :p I'll try to replicate the issue and see what I can find 👌

acasademont avatar Jul 12 '19 09:07 acasademont

@acasademont I tried to increase the number of workers to 512, but it didn't work. Travis-ci tests failed.

image

So for now I'm removing Ubiquity with PHP-PM from Techempower benchmarks, until we find a solution.

So I tried it with 16, 32, 64 and 128 workers, without seeing any real impact on performance.

jcheron avatar Jul 31 '19 22:07 jcheron

I've looked some more into this. It is definitely a problem in the core:

bin/ppm start --workers=8 --bridge=StaticBridge --static-directory=web --max-requests=32 -v &
ab -n 1000 -c 38 http://localhost:8080/

ab will fail sooner, the higher the concurrency level. It will also fail sooner the larger the number of already executed requests. With 38 concurrent I do sometimes get a full run on first try, but often not:

❯ ab -n 1000 -c 38 http://localhost:8080/
This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
apr_socket_recv: Connection reset by peer (54)
Total of 16 requests completed

With 32 concurrent it will almost always work. If it breaks we'll have these messages in the log:

Script did not return a valid HTTP response. Maybe it has called exit() prematurely?

andig avatar Oct 08 '19 19:10 andig

More experiments. I've started giving the slaves ids to trace requests across log messages by slave. Running this with only a single worker:

bin/ppm start --workers=1 --bridge=StaticBridge --static-directory=web --max-requests=1 -v #
Starting PHP-PM with 1 workers, using StreamSelectLoop ...
1 workers (starting at 5501) up and ready. Application is ready at http://127.0.0.1:8080/
> slave 1
Restarting worker #5501 because it reached max requests of 1

ab -n 1000 -c 100 http://localhost:8080/
This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
apr_socket_recv: Connection reset by peer (54)
Total of 1 requests completed

Test dies after only a single request, but only if we chose a high concurrency level!

andig avatar Oct 08 '19 19:10 andig

/cc @WyriHaximus might I ask you if you've seen something like the comment above before where reactphp might get stuck on high concurrency levels?

andig avatar Oct 08 '19 19:10 andig

@andig only seen this happen with ab but never with wrk/wrk2 or other tools when reaching the limits of your set up.

WyriHaximus avatar Oct 09 '19 00:10 WyriHaximus

Taking a look at this today, let's see what we can find!

acasademont avatar Dec 27 '19 18:12 acasademont

@acasademont you know where to find me if you do ;)

WyriHaximus avatar Dec 27 '19 18:12 WyriHaximus

Thanks @acasademont. I‘m concentrating on programming an ev charge controller- so many projects, so little time...

andig avatar Dec 27 '19 18:12 andig

I'm starting from the basics, just a simple "ab -n 2 -c 2 http://127.0.0.1:8080/"

First weird thing I saw: ab seems to open one more connection than needed when concurrency > 1.

albert@AlbertWork:~/Projects/php-pm$[master] bin/ppm start --workers=1 --bridge=StaticBridge --static-directory=web --max-requests=10 -vvv
/Users/albert/Projects/php-pm
+---------------------+----------------------------------------------+
| bridge              | StaticBridge                                 |
| host                | 127.0.0.1                                    |
| port                | 8080                                         |
| workers             | 1                                            |
| app-env             | dev                                          |
| debug               | 0                                            |
| logging             | 1                                            |
| static-directory    | web                                          |
| bootstrap           | PHPPM\Bootstraps\Symfony                     |
| max-requests        | 10                                           |
| max-execution-time  | 30                                           |
| memory-limit        | -1                                           |
| ttl                 | 0                                            |
| populate-server-var | 1                                            |
| socket-path         | .ppm/run/                                    |
| pidfile             | .ppm/ppm.pid                                 |
| reload-timeout      | 30                                           |
| cgi-path            | /usr/local/Cellar/[email protected]/7.3.13/bin/php-cgi |
+---------------------+----------------------------------------------+
Starting PHP-PM with 1 workers, using StreamSelectLoop ...
Start new worker #5501
Worker #5501 registered. Waiting for application bootstrap ...
Worker #5501 ready.
1 workers (starting at 5501) up and ready. Application is ready at http://127.0.0.1:8080/



request incoming 1 tcp://127.0.0.1:63928
1 slaves for request1
took abnormal 0.000 seconds for choosing next free worker
Took abnormal 0.001 seconds for connecting to worker 5501
handle data  for request1
occupied slave 5501 for request 1
handle data GET / HTTP/1.0
Host: 127.0.0.1:8080
User-Agent: ApacheBench/2.3
Accept: */*

 for request1
all data buffered for 1
Current memory usage for worker 5501: 2.00 MB
[27/Dec/2019:18:58:21 +0000] 127.0.0.1 - - "GET / HTTP/1.0" 404 61 "-"
Worker 5501 took abnormal 0.007 seconds for handling a connection
request incoming 2 tcp://127.0.0.1:63929
1 slaves for request2
took abnormal 0.000 seconds for choosing next free worker
Took abnormal 0.000 seconds for connecting to worker 5501
handle data  for request2
occupied slave 5501 for request 2
**request incoming 3 tcp://127.0.0.1:63930**
0 slaves for request3
no slaves for request 3
handle data GET / HTTP/1.0
Host: 127.0.0.1:8080
User-Agent: ApacheBench/2.3
Accept: */*

 for request2
all data buffered for 2
0 slaves for request3
no slaves for request 3
0 slaves for request3
no slaves for request 3
0 slaves for request3
no slaves for request 3
0 slaves for request3
no slaves for request 3
0 slaves for request3
no slaves for request 3
0 slaves for request3
no slaves for request 3
0 slaves for request3
no slaves for request 3
0 slaves for request3
no slaves for request 3
0 slaves for request3
no slaves for request 3
0 slaves for request3
no slaves for request 3
0 slaves for request3
no slaves for request 3
0 slaves for request3
no slaves for request 3
0 slaves for request3
no slaves for request 3
0 slaves for request3
no slaves for request 3
0 slaves for request3
no slaves for request 3
0 slaves for request3
no slaves for request 3
0 slaves for request3
no slaves for request 3
0 slaves for request3
no slaves for request 3
0 slaves for request3
no slaves for request 3
0 slaves for request3
no slaves for request 3
0 slaves for request3
no slaves for request 3
0 slaves for request3
no slaves for request 3
0 slaves for request3
no slaves for request 3
0 slaves for request3
no slaves for request 3
Current memory usage for worker 5501: 2.00 MB
[27/Dec/2019:18:58:21 +0000] 127.0.0.1 - - "GET / HTTP/1.0" 404 61 "-"
0 slaves for request3
no slaves for request 3
Worker 5501 took abnormal 0.022 seconds for handling a connection
1 slaves for request3
took abnormal 0.023 seconds for choosing next free worker
Took abnormal 0.000 seconds for connecting to worker 5501
handle data  for request3
occupied slave 5501 for request 3
Worker 5501 took abnormal 0.000 seconds for handling a connection
**Script did not return a valid HTTP response. Maybe it has called exit() prematurely?**

Things that we could definitely improve:

  • Do not search for a slave until all incoming data has been buffered. It makes no sense to waste resources on the search if this is not an HTTP request after all. This would avoid the misleading Script did not return a valid HTTP response. Maybe it has called exit() prematurely? which has nothing to do with the script response but from the fact that the connection never piped in any data.
  • Leverage the reactphp http server on the master instead of the socket server. This would free us from dealing with this weird stuff, we would get an already buffered Http request and then we would choose a slave to pipe the request to. Right now the master tries to be http agnostic, but it really isn't. I believe we should fully embrace http right on the master and simplify our code.

Thoughts? @andig @WyriHaximus

On the slaves we could also start a simpler http server without the default reactphp middlewares, knowing that the master has already done the hard work for us.

acasademont avatar Dec 27 '19 19:12 acasademont

In any case, to continue with this, we should not emit that Script did not return a valid HTTP response. Maybe it has called exit() prematurely? error when it's the client that has prematurely disconnected.

acasademont avatar Dec 27 '19 20:12 acasademont

I ended up doing some tcpdumps to see why ab was complaining of the connection being reset and the problem seems to be some TCP RST packets emitted by the server that ab doesn't seem able to deal with. This is out of scope of phppm, maybe related to reactphp, maybe related to php streams, can't really tell :(

A simple reactphp server:

<?php

include __DIR__.'/vendor/autoload.php';

$loop = React\EventLoop\Factory::create();

$server = new React\Http\Server(function (Psr\Http\Message\ServerRequestInterface $request) {
    return new React\Http\Response(
        200,
        array('Content-Type' => 'text/plain'),
        "Hello World!\n"
    );
});

$socket = new React\Socket\Server(8080, $loop);
$server->listen($socket);

echo "Server running at http://127.0.0.1:8080\n";

$loop->run();

an ab -n 100 -c 100 http://127.0.0.1:8080/

will yield errors most of the time.

For comparison, I created a nodejs server

const http = require('http')

const requestHandler = (request, response) => {
  response.end('Hello World!')
}

const server = http.createServer(requestHandler)

server.listen(8080, (err) => {
  if (err) {
    return console.log('something bad happened', err)
  }

  console.log('Server running at http://127.0.0.1:8080')
})

and dealt with that concurrency level just fine.

acasademont avatar Dec 27 '19 21:12 acasademont

Attaching the tcpdumps capture for both reactphp and node if that might help.

tcpdump.zip

acasademont avatar Dec 27 '19 21:12 acasademont

Tried it also with another benchmark tool written in go called hey (https://github.com/rakyll/hey). Doesn't miserably fail on the first connection reset like AB, but it still does fail quite often compared to the nodejs server (also much slower, but that's another issue I guess :p)

acasademont avatar Dec 27 '19 22:12 acasademont

@acasademont I've also been playing around with https://k6.io/ lately, different approach but might be interesting to check out as well as you can use it as a unit test to make sure everything works as intended during different levels of load.

WyriHaximus avatar Dec 28 '19 10:12 WyriHaximus

@WyriHaximus seems like the problem is a very low TCP backlog value by default on PHP streams (32). php-pm should probably set a specific (and configurable) parameter for that...After increasing it to 128 everything works fine!

acasademont avatar Dec 28 '19 10:12 acasademont

@acasademont awesome!

WyriHaximus avatar Dec 28 '19 10:12 WyriHaximus

See #488 for fix. Confirmation this is solved in HEAD would be great.

andig avatar Jan 02 '20 21:01 andig