protoactor-dotnet
protoactor-dotnet copied to clipboard
Performance doesn't scale with more cores
We've performed several benchmarks on 8, 16, 32, 48 AWS cores and discovered these results:
8->16 cores: almost +100% RPS 16->32 cores: +20% RPS 32->48 cores: +1% RPS
Take a look at these results.
16 cores:
32 cores:
48 cores:
The benchmark runs up to 256 parallel requests.
Profiler shows that most work is done by thread pool WorkerThreadStart
method inside its loop where it waits for tasks and calls Semaphore.Wait
.
We tried running different configurations with 1-2 clients and 1-2 servers, changing parallel requests count, changing dispatcher throughput but nothing showed any significant improvement.
What can cause this?
We'll try to prepare a reproducable example if you are willing to investigate this.
Yes, we need some code examples for what you are doing here. I see in the benchmark that you are using Proto.Remote. it could very well be that you are maxing out your network ? if the network can only push x messages per second, you are not going to benefit from more cores.
But please do post some example on what you are actually doing here, as it is only guesswork otherwise
We have the similar issue even without Remote
ok, then we need some code example to reproduce this
Sorry, we are still going to provide an example. This stuff requires some time to agree with the company.
the source code to reproduce the problem in the attachment performance-repro.zip
To run from IDE:
- Open solution BEP.sln
- Run docker-start-dev.cmd
- Run project benchmarks\PrototypeBenchmark
To run remotely:
- Install docker on remote machine and Docker Desktop on local
-
ssh -L 2378:127.0.0.1:2375 [email protected]
in another session, git bash: -
export DOCKER_HOST=tcp://127.0.0.1:2378
./docker-start-staging.cmd
@rogeralsing please reopen
I'm running the example right now and the first thing that comes to mind is that you are probably queueing up a lot of fire and forget tasks on the threadpool
.5987 RPS, 99% latency 17,61 ms, 95% latency 9,39 ms, max latency 167,61 ms
...60692 RPS, 99% latency 15,4 ms, 95% latency 6,51 ms, max latency 610,77 ms
...44698 RPS, 99% latency 20,9 ms, 95% latency 9,33 ms, max latency 745,69 ms
..35911 RPS, 99% latency 28,62 ms, 95% latency 11,54 ms, max latency 725,73 ms
.27488 RPS, 99% latency 33,26 ms, 95% latency 15,39 ms, max latency 999,47 ms
..31520 RPS, 99% latency 22,41 ms, 95% latency 11,55 ms, max latency 975,2 ms
.19651 RPS, 99% latency 39,24 ms, 95% latency 20,35 ms, max latency 1050,25 ms
.19856 RPS, 99% latency 39,76 ms, 95% latency 17,88 ms, max latency 1366,85 ms
The increasing latency might be that the threadpool is busy with other tasks. e.g.
omsGrain.ProccedExecutionReport(omsRequest, CancellationToken.None).AndForget(TaskOption.Safe);
Eventually, the entire threadpool queue might be filled with this kind of tasks.
I'll dig deeper later today, but the increasing latency is very suspicious.
That's a pretty big latency. Do you run in Debug configuration or with debugger attached? I wouldn't rely on these latency numbers. As you see in the screenshots above, with all optimizations and 16+ cores we don't have latency changing that much.
In this repro no additional executions are added to the list in ObActor.ExecuteOrder
so OmsActor
shouldn't do any fire-and-forget calls because the single returned ExecutionReport
belongs to this OmsActor
instance. So I'm surprised that you see such calls. Though, in our real app those calls are present.
There seems to be a lot of locking going on in this example. I saw that there is some use of SemaphoreSlim and .Wait(), but I haven't analyzed the impact of that specifically. But looking at the profiler results. something in this example is explicitly blocking threads in the threadpool
@rogeralsing, we use Semaphore to limit the number of concurrent requests. The waiting time should be as big as needed for the system to process a request and free its "slot". It's not a problem at all. It's only one thread and it even doesn't belong to threadpool.
We already profiled the thing, I already saw what's on the screenshots. For example, WorkerThreadStart is not a new thread startup but a loop that pick ups tasks from thread pool queue. Btw, this method also uses Semaphore.
@rogeralsing my guess is that too much GC is going on in 0 and 1 generations. The garbage is produced by tasks and async state machines. Unlike 2nd generation, such collections are always stop-the-world. It looks like at some point of vertical scaling GC time grows more than the load that can be processed so we see no improvement from adding cores.