protoactor-dotnet Performance doesn't scale with more cores

We've performed several benchmarks on 8, 16, 32, 48 AWS cores and discovered these results:

8->16 cores: almost +100% RPS 16->32 cores: +20% RPS 32->48 cores: +1% RPS

Take a look at these results.

16 cores:

32 cores:

48 cores:

The benchmark runs up to 256 parallel requests.

Profiler shows that most work is done by thread pool WorkerThreadStart method inside its loop where it waits for tasks and calls Semaphore.Wait.

We tried running different configurations with 1-2 clients and 1-2 servers, changing parallel requests count, changing dispatcher throughput but nothing showed any significant improvement.

What can cause this?

We'll try to prepare a reproducable example if you are willing to investigate this.

Apr 27 '23 13:04 AqlaSolutions

Yes, we need some code examples for what you are doing here. I see in the benchmark that you are using Proto.Remote. it could very well be that you are maxing out your network ? if the network can only push x messages per second, you are not going to benefit from more cores.

But please do post some example on what you are actually doing here, as it is only guesswork otherwise

Apr 28 '23 04:04 rogeralsing

We have the similar issue even without Remote

Apr 28 '23 08:04 AqlaSolutions

ok, then we need some code example to reproduce this

Apr 28 '23 08:04 rogeralsing

Sorry, we are still going to provide an example. This stuff requires some time to agree with the company.

May 07 '23 18:05 AqlaSolutions

the source code to reproduce the problem in the attachment performance-repro.zip

May 16 '23 06:05 Pushcin

To run from IDE:

Open solution BEP.sln
Run docker-start-dev.cmd
Run project benchmarks\PrototypeBenchmark

To run remotely:

Install docker on remote machine and Docker Desktop on local
ssh -L 2378:127.0.0.1:2375 [email protected] in another session, git bash:
export DOCKER_HOST=tcp://127.0.0.1:2378 ./docker-start-staging.cmd

@rogeralsing please reopen

May 16 '23 06:05 AqlaSolutions

I'm running the example right now and the first thing that comes to mind is that you are probably queueing up a lot of fire and forget tasks on the threadpool

.5987 RPS, 99% latency 17,61 ms, 95% latency 9,39 ms, max latency 167,61 ms
...60692 RPS, 99% latency 15,4 ms, 95% latency 6,51 ms, max latency 610,77 ms
...44698 RPS, 99% latency 20,9 ms, 95% latency 9,33 ms, max latency 745,69 ms
..35911 RPS, 99% latency 28,62 ms, 95% latency 11,54 ms, max latency 725,73 ms
.27488 RPS, 99% latency 33,26 ms, 95% latency 15,39 ms, max latency 999,47 ms
..31520 RPS, 99% latency 22,41 ms, 95% latency 11,55 ms, max latency 975,2 ms
.19651 RPS, 99% latency 39,24 ms, 95% latency 20,35 ms, max latency 1050,25 ms
.19856 RPS, 99% latency 39,76 ms, 95% latency 17,88 ms, max latency 1366,85 ms

The increasing latency might be that the threadpool is busy with other tasks. e.g.

omsGrain.ProccedExecutionReport(omsRequest, CancellationToken.None).AndForget(TaskOption.Safe);

Eventually, the entire threadpool queue might be filled with this kind of tasks.

I'll dig deeper later today, but the increasing latency is very suspicious.

May 18 '23 08:05 rogeralsing

That's a pretty big latency. Do you run in Debug configuration or with debugger attached? I wouldn't rely on these latency numbers. As you see in the screenshots above, with all optimizations and 16+ cores we don't have latency changing that much.

May 18 '23 08:05 AqlaSolutions

In this repro no additional executions are added to the list in ObActor.ExecuteOrder so OmsActor shouldn't do any fire-and-forget calls because the single returned ExecutionReport belongs to this OmsActor instance. So I'm surprised that you see such calls. Though, in our real app those calls are present.

May 18 '23 08:05 AqlaSolutions

There seems to be a lot of locking going on in this example. I saw that there is some use of SemaphoreSlim and .Wait(), but I haven't analyzed the impact of that specifically. But looking at the profiler results. something in this example is explicitly blocking threads in the threadpool

May 20 '23 08:05 rogeralsing

@rogeralsing, we use Semaphore to limit the number of concurrent requests. The waiting time should be as big as needed for the system to process a request and free its "slot". It's not a problem at all. It's only one thread and it even doesn't belong to threadpool.

We already profiled the thing, I already saw what's on the screenshots. For example, WorkerThreadStart is not a new thread startup but a loop that pick ups tasks from thread pool queue. Btw, this method also uses Semaphore.

May 21 '23 07:05 AqlaSolutions

@rogeralsing my guess is that too much GC is going on in 0 and 1 generations. The garbage is produced by tasks and async state machines. Unlike 2nd generation, such collections are always stop-the-world. It looks like at some point of vertical scaling GC time grows more than the load that can be processed so we see no improvement from adding cores.

May 21 '23 07:05 AqlaSolutions

protoactor-dotnet protoactor-dotnet copied to clipboard

Performance doesn't scale with more cores

protoactor-dotnet
protoactor-dotnet copied to clipboard