Tmds.LinuxAsync
Tmds.LinuxAsync copied to clipboard
Benchmark parameters for Citrine runs
Let's define the parameters we want to use for extensive Citrine runs (or at least for the first one).
The machines are mine for Friday (big thanks to Sebastien!), but my naive approach to try all combinations of the major parameters is defining too many jobs, even for a whole day run. Are some of these combinations worthless to include even in a comprehensive analysis?
I'm using the syntax of my new tool for #74 to define the benchmakrs
DefaultTransport to get a baseline:
e=DefaultTransport
t=4,6,8,10,12,13,14,15,16,17,18,20,22,24,26,28,30
LinuxTransport for our information:
e=LinuxTransport
i=true
t=4,6,8,10,12,13,14,15,16,17,18,20,22,24,26,28,30
epoll combinations
Normally I would run epoll with all possible combinations of important parameters, but the following definition would mean ~500 benchmark executions. Would be nice to reduce it.
e=epoll
s=false
r=false
w=false
c=true,false
i=false,true
a=false,true
o=inline,iothread,ioqueue,threadpool
t=4,6,8,10,12,13,14,15,16,17,18,20,22,24,26,28,30
ThreadPool with COMPlus_ThreadPool_UnfairSemaphoreSpinLimit=0
I would set c=true
here:
e=epoll
s=false
r=false
w=false
c=true
i=true
a=true
o=threadpool
t=1,2,4,6,8,10,11,12,13,14,15,16,17,18,19,20,22,24,26,28,30
io_uring combinations
e=iouring
s=false
r=false
w=false
c=true,false
i=false,true
o=inline,iothread,ioqueue,threadpool
t=1,2,4,6,8,10,11,12,13,14,15,16,17,18,19,20,22,24,26,28,30
@tmds @adamsitnik anything missing? Which combos should I cut off?
I would cut few of the thread counts to reduce the number of runs.
t=4,6,8,10,12,13,14,15,16,17,18,20,22,24,26,28,30
Baseline
I consider this the baseline:
e=epoll
s=false
r=false
w=false
c=threadpool
i=threadpool
a=false
o=ioqueue
Benchmarks focused on batching
batch by deferring receives
e=epoll/iouring
r=true
a=true
this should perform worse than
batch receives on poll thread
e=epoll/iouring
c=inline
a=true
continue inline without batching
It is also interesting to run the continuations inline, without batching, to differentiate between effect of batching and continuation.
e=epoll
c=inline
a=false
note: iouring doesn't have a mode that disables batching.
Benchmarks focused on scheduling
e=epoll
c=inline/threadpool
i=inline/threadpool
o=inline,iothread,ioqueue,threadpool
a=true[,false]
I'm not sure if we should include a=false
. I assume it can only be better, but better to not assume?
Since this is all scheduling, we should consider running all of these also with COMPlus_ThreadPool_UnfairSemaphoreSpinLimit=0
.
For iouring
similar results are expected, so I'd limit to:
e=iouring
c=inline
i=inline
o=inline,iothread,ioqueue,threadpool
I'm not sure if we should include a=false. I assume it can only be better, but better to not assume?
I think the point is to have a proper comparison between AIO / no AIO.
Since this is all scheduling, we should consider running all of these also with COMPlus_ThreadPool_UnfairSemaphoreSpinLimit=0
Doesn't this affect only ThreadPool schedulers? Isn't it waste of time to run all the rest?
DefaultTransport to get a baseline:
e=DefaultTransport
I consider this one interesting too. It's important to note that t
controls the IOQueue
count.
If we are using a daily ASP.NET Core build, maybe we can also implement, and set w=false
?
LinuxTransport for our information:
e=LinuxTransport
i=true
For max performance, you should also set s=true
also.
Doesn't this affect only ThreadPool schedulers? Isn't it waste of time to run all the rest?
IoQueue is also a ThreadPool scheduler. And I think Kestrel will always use ThreadPool for dispatching the HTTP handling in it's KestrelConnection class.
Maybe these could be left out for COMPlus_ThreadPool_UnfairSemaphoreSpinLimit=0
:
c=inline
i=inline
o=inline,iothread
a=true[,false]
@adamsitnik @antonfirsov do we want to run middleware json/platform json (https://github.com/tmds/Tmds.LinuxAsync/issues/32)? I'm fine using middleware, but maybe you have a specific preference for platform?
Note that the pipelined plaintext will suffer from o=inline
, since every response will be sent separately, instead of batched together by other output schedulers.
do we want to run middleware json/platform json
I think that we should use middleware and compare it with current middleware implementation which is around 750k RPS (link to PowerBI)
@tmds @adamsitnik you can check the results here: https://microsoft-my.sharepoint.com/:x:/p/anfirszo/ETUPVQ8QN9BGmysfL5uDJswBpZsSrKZtuFaMtaoU7ifGUQ?e=s1H2gY
The grouping should be straightforward, but if it's not I'm happy to answer questions. On several places, there are multiple versions of the same diagram with different series enabled/disabled. Red lines in table are for missing or outlier data.
@tmds does this help getting insights? Is there anything unexpected to you? Anything else we should run?
@tmds as we discussed, I extended the ThreadPool scheduling benchmarks with t=1,2,3
, and also added graphs comparing the impact of COMPlus_ThreadPool_UnfairSemaphoreSpinLimit
. It's only measurable for small t
values.
Thanks Anton! The effect being mostly being at lower t
is expected. At lower t
more work comes in batches from the epoll thread to the ThreadPool.
@antonfirsov this is the combination we discussed that would be interesting also to benchmark on Citrine:
e=epoll
c=inline
i=threadpool
o=inline,iothread,ioqueue,threadpool
a=true
t=1,2,3,4,6,8,10,12,13,14,15,16,17,18,20,22,24,26,28,30
Anton, can you also run these benchmarks?
e=epoll
c=threadpool
i=inline
o=inline,iothread,ioqueue,threadpool
a=true
t=1,2,3,4,6,8,10,12,13,14,15,16,17,18,20,22,24,26,28,30
Also sharing the doc here: https://microsoft.sharepoint.com/:x:/r/teams/SocketsPerfWG/Shared%20Documents/General/ContinuationsComparison-Full-0407.xlsx?d=w4e6c85d2c7c54431b8b77793d894e6d0&csf=1&web=1&e=hg3T2q
@tmds added the 2 graphs you requested
Thank you Anton!
We're missing
t=1
c=threadpool
i=inline
It's an interesting point on the graph (should be best for c=threadpool,i=inline
). I'm going to assume same value as t=2
.