Tmds.LinuxAsync icon indicating copy to clipboard operation
Tmds.LinuxAsync copied to clipboard

Benchmark parameters for Citrine runs

Open antonfirsov opened this issue 4 years ago • 15 comments

Let's define the parameters we want to use for extensive Citrine runs (or at least for the first one).

The machines are mine for Friday (big thanks to Sebastien!), but my naive approach to try all combinations of the major parameters is defining too many jobs, even for a whole day run. Are some of these combinations worthless to include even in a comprehensive analysis?

I'm using the syntax of my new tool for #74 to define the benchmakrs

DefaultTransport to get a baseline:

e=DefaultTransport
t=4,6,8,10,12,13,14,15,16,17,18,20,22,24,26,28,30

LinuxTransport for our information:

e=LinuxTransport
i=true
t=4,6,8,10,12,13,14,15,16,17,18,20,22,24,26,28,30

epoll combinations

Normally I would run epoll with all possible combinations of important parameters, but the following definition would mean ~500 benchmark executions. Would be nice to reduce it.

e=epoll
s=false
r=false
w=false
c=true,false
i=false,true
a=false,true
o=inline,iothread,ioqueue,threadpool
t=4,6,8,10,12,13,14,15,16,17,18,20,22,24,26,28,30

ThreadPool with COMPlus_ThreadPool_UnfairSemaphoreSpinLimit=0

I would set c=true here:

e=epoll
s=false
r=false
w=false
c=true
i=true
a=true
o=threadpool
t=1,2,4,6,8,10,11,12,13,14,15,16,17,18,19,20,22,24,26,28,30

io_uring combinations

e=iouring
s=false
r=false
w=false
c=true,false
i=false,true
o=inline,iothread,ioqueue,threadpool
t=1,2,4,6,8,10,11,12,13,14,15,16,17,18,19,20,22,24,26,28,30

@tmds @adamsitnik anything missing? Which combos should I cut off?

antonfirsov avatar Mar 26 '20 23:03 antonfirsov

I would cut few of the thread counts to reduce the number of runs.

t=4,6,8,10,12,13,14,15,16,17,18,20,22,24,26,28,30

adamsitnik avatar Mar 27 '20 11:03 adamsitnik

Baseline

I consider this the baseline:

e=epoll
s=false
r=false
w=false
c=threadpool
i=threadpool
a=false
o=ioqueue

Benchmarks focused on batching

batch by deferring receives

e=epoll/iouring
r=true
a=true

this should perform worse than

batch receives on poll thread

e=epoll/iouring
c=inline
a=true

continue inline without batching

It is also interesting to run the continuations inline, without batching, to differentiate between effect of batching and continuation.

e=epoll
c=inline
a=false

note: iouring doesn't have a mode that disables batching.

Benchmarks focused on scheduling

e=epoll
c=inline/threadpool
i=inline/threadpool
o=inline,iothread,ioqueue,threadpool
a=true[,false]

I'm not sure if we should include a=false. I assume it can only be better, but better to not assume? Since this is all scheduling, we should consider running all of these also with COMPlus_ThreadPool_UnfairSemaphoreSpinLimit=0.

For iouring similar results are expected, so I'd limit to:

e=iouring
c=inline
i=inline
o=inline,iothread,ioqueue,threadpool

tmds avatar Mar 27 '20 12:03 tmds

I'm not sure if we should include a=false. I assume it can only be better, but better to not assume?

I think the point is to have a proper comparison between AIO / no AIO.

Since this is all scheduling, we should consider running all of these also with COMPlus_ThreadPool_UnfairSemaphoreSpinLimit=0

Doesn't this affect only ThreadPool schedulers? Isn't it waste of time to run all the rest?

antonfirsov avatar Mar 27 '20 12:03 antonfirsov

DefaultTransport to get a baseline:

e=DefaultTransport

I consider this one interesting too. It's important to note that t controls the IOQueue count. If we are using a daily ASP.NET Core build, maybe we can also implement, and set w=false?

LinuxTransport for our information:

e=LinuxTransport
i=true

For max performance, you should also set s=true also.

tmds avatar Mar 27 '20 12:03 tmds

Doesn't this affect only ThreadPool schedulers? Isn't it waste of time to run all the rest?

IoQueue is also a ThreadPool scheduler. And I think Kestrel will always use ThreadPool for dispatching the HTTP handling in it's KestrelConnection class.

Maybe these could be left out for COMPlus_ThreadPool_UnfairSemaphoreSpinLimit=0:

c=inline
i=inline
o=inline,iothread
a=true[,false]

tmds avatar Mar 27 '20 12:03 tmds

@adamsitnik @antonfirsov do we want to run middleware json/platform json (https://github.com/tmds/Tmds.LinuxAsync/issues/32)? I'm fine using middleware, but maybe you have a specific preference for platform?

Note that the pipelined plaintext will suffer from o=inline, since every response will be sent separately, instead of batched together by other output schedulers.

tmds avatar Mar 27 '20 13:03 tmds

do we want to run middleware json/platform json

I think that we should use middleware and compare it with current middleware implementation which is around 750k RPS (link to PowerBI)

adamsitnik avatar Mar 27 '20 17:03 adamsitnik

@tmds @adamsitnik you can check the results here: https://microsoft-my.sharepoint.com/:x:/p/anfirszo/ETUPVQ8QN9BGmysfL5uDJswBpZsSrKZtuFaMtaoU7ifGUQ?e=s1H2gY

The grouping should be straightforward, but if it's not I'm happy to answer questions. On several places, there are multiple versions of the same diagram with different series enabled/disabled. Red lines in table are for missing or outlier data.

@tmds does this help getting insights? Is there anything unexpected to you? Anything else we should run?

antonfirsov avatar Mar 31 '20 00:03 antonfirsov

@tmds as we discussed, I extended the ThreadPool scheduling benchmarks with t=1,2,3, and also added graphs comparing the impact of COMPlus_ThreadPool_UnfairSemaphoreSpinLimit. It's only measurable for small t values.

image

antonfirsov avatar Mar 31 '20 18:03 antonfirsov

Thanks Anton! The effect being mostly being at lower t is expected. At lower t more work comes in batches from the epoll thread to the ThreadPool.

tmds avatar Apr 01 '20 12:04 tmds

@antonfirsov this is the combination we discussed that would be interesting also to benchmark on Citrine:

e=epoll
c=inline
i=threadpool
o=inline,iothread,ioqueue,threadpool
a=true
t=1,2,3,4,6,8,10,12,13,14,15,16,17,18,20,22,24,26,28,30

tmds avatar Apr 03 '20 12:04 tmds

Anton, can you also run these benchmarks?

e=epoll
c=threadpool
i=inline
o=inline,iothread,ioqueue,threadpool
a=true
t=1,2,3,4,6,8,10,12,13,14,15,16,17,18,20,22,24,26,28,30

tmds avatar Apr 03 '20 20:04 tmds

Also sharing the doc here: https://microsoft.sharepoint.com/:x:/r/teams/SocketsPerfWG/Shared%20Documents/General/ContinuationsComparison-Full-0407.xlsx?d=w4e6c85d2c7c54431b8b77793d894e6d0&csf=1&web=1&e=hg3T2q

@tmds added the 2 graphs you requested

antonfirsov avatar Apr 07 '20 22:04 antonfirsov

Thank you Anton!

tmds avatar Apr 08 '20 03:04 tmds

We're missing

t=1
c=threadpool
i=inline

It's an interesting point on the graph (should be best for c=threadpool,i=inline). I'm going to assume same value as t=2.

tmds avatar Apr 08 '20 03:04 tmds