protoactor-dotnet icon indicating copy to clipboard operation
protoactor-dotnet copied to clipboard

Performance doesn't scale with more cores

Open AqlaSolutions opened this issue 1 year ago • 12 comments

We've performed several benchmarks on 8, 16, 32, 48 AWS cores and discovered these results:

8->16 cores: almost +100% RPS 16->32 cores: +20% RPS 32->48 cores: +1% RPS

Take a look at these results.

16 cores: image image

32 cores: image image

48 cores: image image

The benchmark runs up to 256 parallel requests.

Profiler shows that most work is done by thread pool WorkerThreadStart method inside its loop where it waits for tasks and calls Semaphore.Wait. image

We tried running different configurations with 1-2 clients and 1-2 servers, changing parallel requests count, changing dispatcher throughput but nothing showed any significant improvement.

What can cause this?

We'll try to prepare a reproducable example if you are willing to investigate this.

AqlaSolutions avatar Apr 27 '23 13:04 AqlaSolutions

Yes, we need some code examples for what you are doing here. I see in the benchmark that you are using Proto.Remote. it could very well be that you are maxing out your network ? if the network can only push x messages per second, you are not going to benefit from more cores.

But please do post some example on what you are actually doing here, as it is only guesswork otherwise

rogeralsing avatar Apr 28 '23 04:04 rogeralsing

We have the similar issue even without Remote

AqlaSolutions avatar Apr 28 '23 08:04 AqlaSolutions

ok, then we need some code example to reproduce this

rogeralsing avatar Apr 28 '23 08:04 rogeralsing

Sorry, we are still going to provide an example. This stuff requires some time to agree with the company.

AqlaSolutions avatar May 07 '23 18:05 AqlaSolutions

the source code to reproduce the problem in the attachment performance-repro.zip

Pushcin avatar May 16 '23 06:05 Pushcin

To run from IDE:

  1. Open solution BEP.sln
  2. Run docker-start-dev.cmd
  3. Run project benchmarks\PrototypeBenchmark

To run remotely:

  1. Install docker on remote machine and Docker Desktop on local
  2. ssh -L 2378:127.0.0.1:2375 [email protected] in another session, git bash:
  3. export DOCKER_HOST=tcp://127.0.0.1:2378 ./docker-start-staging.cmd

@rogeralsing please reopen

AqlaSolutions avatar May 16 '23 06:05 AqlaSolutions

I'm running the example right now and the first thing that comes to mind is that you are probably queueing up a lot of fire and forget tasks on the threadpool

.5987 RPS, 99% latency 17,61 ms, 95% latency 9,39 ms, max latency 167,61 ms
...60692 RPS, 99% latency 15,4 ms, 95% latency 6,51 ms, max latency 610,77 ms
...44698 RPS, 99% latency 20,9 ms, 95% latency 9,33 ms, max latency 745,69 ms
..35911 RPS, 99% latency 28,62 ms, 95% latency 11,54 ms, max latency 725,73 ms
.27488 RPS, 99% latency 33,26 ms, 95% latency 15,39 ms, max latency 999,47 ms
..31520 RPS, 99% latency 22,41 ms, 95% latency 11,55 ms, max latency 975,2 ms
.19651 RPS, 99% latency 39,24 ms, 95% latency 20,35 ms, max latency 1050,25 ms
.19856 RPS, 99% latency 39,76 ms, 95% latency 17,88 ms, max latency 1366,85 ms

The increasing latency might be that the threadpool is busy with other tasks. e.g.

omsGrain.ProccedExecutionReport(omsRequest, CancellationToken.None).AndForget(TaskOption.Safe);

Eventually, the entire threadpool queue might be filled with this kind of tasks.

I'll dig deeper later today, but the increasing latency is very suspicious.

rogeralsing avatar May 18 '23 08:05 rogeralsing

That's a pretty big latency. Do you run in Debug configuration or with debugger attached? I wouldn't rely on these latency numbers. As you see in the screenshots above, with all optimizations and 16+ cores we don't have latency changing that much.

AqlaSolutions avatar May 18 '23 08:05 AqlaSolutions

In this repro no additional executions are added to the list in ObActor.ExecuteOrder so OmsActor shouldn't do any fire-and-forget calls because the single returned ExecutionReport belongs to this OmsActor instance. So I'm surprised that you see such calls. Though, in our real app those calls are present.

AqlaSolutions avatar May 18 '23 08:05 AqlaSolutions

There seems to be a lot of locking going on in this example. I saw that there is some use of SemaphoreSlim and .Wait(), but I haven't analyzed the impact of that specifically. But looking at the profiler results. something in this example is explicitly blocking threads in the threadpool

Skärmavbild 2023-05-20 kl  10 42 01 Skärmavbild 2023-05-20 kl  10 42 13

rogeralsing avatar May 20 '23 08:05 rogeralsing

@rogeralsing, we use Semaphore to limit the number of concurrent requests. The waiting time should be as big as needed for the system to process a request and free its "slot". It's not a problem at all. It's only one thread and it even doesn't belong to threadpool.

We already profiled the thing, I already saw what's on the screenshots. For example, WorkerThreadStart is not a new thread startup but a loop that pick ups tasks from thread pool queue. Btw, this method also uses Semaphore.

AqlaSolutions avatar May 21 '23 07:05 AqlaSolutions

@rogeralsing my guess is that too much GC is going on in 0 and 1 generations. The garbage is produced by tasks and async state machines. Unlike 2nd generation, such collections are always stop-the-world. It looks like at some point of vertical scaling GC time grows more than the load that can be processed so we see no improvement from adding cores.

AqlaSolutions avatar May 21 '23 07:05 AqlaSolutions