volley
volley copied to clipboard
Benchmarking with a large number of clients
I wonder how the goroutine/thread benchmarks compare when there are a very large number of clients. It's reasonable for a production server to handle 2,000 up to 20,000 clients at once. When the number of clients is extreme large, especially at the 20,000 range, how do the various benchmarks compare?
I'd love to benchmark this, but the problem is that I don't have thousands of cores lying around. At some point, all the cores dedicated to the clients will be saturated, at which point adding more clients won't actually make a difference (if anything, the performance might decrease due to context switching). I also can't move to using multiple machines, because these rates can often not be sustained over anything but loopback (mostly because the kernel can't saturate the NICs).
I suppose one option is that the client becomes non blocking and each client sends on multiple open connections. My goal would be to see if there is a use case where lighter goroutines match up better to threads (maybe there isn't).
I'm not sure making the client non-blocking would actually make a difference. The packets would still need to be enqueued, so you'll still incur the performance overhead of a syscall per send and per receive, and those syscalls are where the clients are spending a majority of their time anyway.
I completely agree that this would be interesting for many reasons -- benchmarking thread-per-client vs a worker pool when there are many clients for example -- but I don't know how to do that in a way that will yield relevant results. My research group might be getting a 144 core machine, and with that I should be able to push the number of clients slightly higher, but unless you've got some spare 2000-core server lying around, getting higher numbers will be challenging.
This would be a really nice test too, the first tests you built can test the raw network speed, while tests with massive ammount of clients can test the nonblocking-io features.
You don't need 1 thread per connection, this is why rust has mio, c has epool and golang has this feature native.
Again, I'm not sure non-blocking I/O will actually make a difference here. At some point, all the client cores will be saturated, and the performance cannot increase regardless of how many clients you stripe across those cores.
My perspective would not be about gaining a performance increase, but measuring the language with the least performance decrease.
Ah, I see. Yes, that could work. I suspect what we'll see is that any solution using epoll will have a fairly negligible performance decrease, whereas the ones using a thread-per-client will drop drastically due to scheduling overhead. Worker pools will likely be somewhere in between.
I'll try to run experiments with 2000 and 20000 clients tomorrow.
@cep21 so how about I instead run a benchmark with log(number of clients) on the x-axis, performance on the y-axis, and then plot a separate graph for, say, 1, 10, 20, and 40 cores? That way, you'll see the trend clearly, and also how much the number of cores affect it. I can run that with 40 (one per client CPU), 80, 160, 320, 640, 1280, 2560, 5120, 10240, and 20480 clients for example?
That seems reasonable.
:+1: I'll run the benchmark as soon as the 80-core machine is available, and then post them here.
Hmm, that's strange. With a large number of clients, I'm seeing a large number of ETIMEDOUT and ECONNRESET errors from the clients, so the test never actually completes correctly. It seems like the server is overwhelmed trying to accept connections. The way to fix this is probably to have multiple threads accepting connections, but this requires non-trivial changes to every server. Alternatively, I could have the clients retry connecting to the server indefinitely, though that becomes problematic if the server crashes.. I'm open to suggestions.
Do all servers equally misbehave? I'm curious where that breaking point is.
Another option is to increase the timeout on the client side?
Interestingly, no, it only seems to be happening for c-threaded. I wonder why -- it doesn't do anything special compared to, say, the Rust implementation, as far as I'm aware.
Well, it's kind of weird, because the default timeout in Linux for connect is 20 seconds, and I'm seeing timeouts failing long before 20 seconds have passed. This suggests that something else is going on, but I'm not quite sure what..
Ugh, for many of the tests, the majority of testing time are now simply waiting for all the client threads to connect :confused: Since accept is only called from a single thread, this takes forever. I think we'll have to implement multi-threaded accept for all the servers before it is even feasible to run this benchmark with > 10000 clients.
Results for up to 5k clients with all implementations except c-threaded:
(box titles denote number of cores)