snabb
snabb copied to clipboard
The case for unrolling the main push/pull loops
The basic loop structure of Snabb is the following (simplified)
repeat
for i = 1, #breathe_pull_order do
breathe_pull_order[i]:pull()
end
for i = 1, #breathe_push_order do
breathe_push_order[i]:push()
end
until done and done()
The actual workload is processed in the pull()
and push()
methods, which, in general, contain loops of the kind
for _ = 1, link.nreadable(link) do
local p = link.receive(link)
-- Process packet
end
Those are the loops that we would like to be compiled most efficiently, i.e. containing an actual loop in the trace itself (a "looping" trace). Let's assume that the compiler actually does find those traces (but note that this is by no means guaranteed in a complex Snabb program). Due to the design of the tracing compiler, the outer loops cannot form looping traces themselves and end up being implemented by connecting distinct "straight" traces. Those traces are considerably less optimized and the number and inner structure of them can have a significant impact on the overall performance of the program.
I would like to argue that it is beneficial to replace the explicite second-level loops over the breathe_pull_order
and breathe_push_order
arrays by implicit loops using goto
:
repeat
local i = 1
::PULL_LOOP::
do
if i > #breathe_pull_order then goto PULL_EXIT
breathe_pull_order[i]:pull()
i = i + 1
goto PULL_LOOP
end
::PULL_EXIT::
i = 1
::PUSH_LOOP::
do
if i > #breathe_push_order then goto PUSH_EXIT
breathe_push_order[i]:push()
i = i + 1
goto PUSH_LOOP
end
::PUSH_EXIT::
until done and done()
To study the impact of this, consider the following simplified model of the breathe()
loop
local C = require("ffi").C
local napps = 5
local iterations = 1e8
local apps = {}
local mode = "std"
if #main.parameters >= 1 then
mode = main.parameters[1]
end
count = 0
threshold = 1000
if #main.parameters >= 2 then
threshold = tonumber(main.parameters[2])
end
local function new_App ()
local App = {}
App.pull = loadstring([[
return function (self)
if count > threshold then
for _ = 1, 100 do
self.i = self.i+1
end
end
end]])()
return App
end
for i = 1, napps do
apps[i] = setmetatable({ i = 0 }, { __index = new_App() })
end
local start = C.get_time_ns()
if mode == "std" then
repeat
for i = 1, napps do
apps[i]:pull()
end
count = count + 1
until count > iterations
else
repeat
local i = 1
::LOOP::
do
if i > napps then goto DONE end
apps[i]:pull()
i = i + 1
goto LOOP
end
::DONE::
count = count + 1
until count > iterations
end
local now = C.get_time_ns()
print("elapsed "..(tonumber(now-start)/1e9).." seconds")
The program instantiates 5 pseudo-apps, each of which has a trivial pull
method that simply increments a counter. Those methods are called from a nested loop in the style of a breathe
loop. For simplicity, we omit the push
loop, which will not change our findings in a fundamental way because the pull
and push
loops are sequential (i.e. the overall loop structure is the same in a "topological" sense).
The program takes two arguments. The first one is a string that selects which style of loop should be used. std
will use the traditional nested loops while any other value will use the "flattened" loop using goto
. The main repeat
loop keeps track of its iterations in the global variable counter
and the program terminates after 1e8
iterations, printing the time in seconds that has elapsed while the loop was being executed.The second parameter to the program is a number that determines after how many iterations of of the main loop the loops in the pull
methods should be executed. This allows us to simulate a Snabb program which is idle for a while right after starting. I believe that this is actually the case in most real-world applications.
To simulate diversity in the apps being run, the actual functions that implement the pull
methods are created via loadstring()
. This makes sure that they will be implemented by different prototypes.
Let's start by studying a completely idle system with the standard loops
$ sudo ./snabb snsh -jdump=+rs,/tmp/sim-empty-std.txt breathsim.lua std 1e20
elapsed 7.638000843 seconds
Using perf top
we see 7 active traces
27.66% perf-30093.map [.] TRACE_2::breathsim.lua:36
21.58% perf-30093.map [.] TRACE_4::breathsim.lua:37
13.38% perf-30093.map [.] TRACE_7::breathsim.lua:37
11.16% perf-30093.map [.] TRACE_3::breathsim.lua:37
10.11% perf-30093.map [.] TRACE_10::breathsim.lua:37
9.02% perf-30093.map [.] TRACE_9::breathsim.lua:37
5.63% perf-30093.map [.] TRACE_5::breathsim.lua:37
sim-empty-std.txt
The only loop that gets compiled is the for i = 1, napps do
loop. Let's look at trace #2
---- TRACE 2 start breathsim.lua:36
0057 TGETV 11 3 10
0058 MOV 12 11
0059 TGETS 11 11 15 ; "pull"
0060 CALL 11 1 2
0000 . FUNCF 6 ; 0x40ce9090:1
0001 . GGET 1 0 ; "count"
0002 . GGET 2 1 ; "threshold"
0003 . ISGE 2 1
0004 . JMP 1 => 0013
0013 . RET0 0 1
0061 FORL 7 => 0057
---- TRACE 2 IR
.... SNAP #0 [ ---- ]
0001 rax > int SLOAD #9 CRI
0002 > int LE 0001 +2147483646
0003 rbp int SLOAD #8 CI
0004 r13 > tab SLOAD #4 T
0005 int FLOAD 0004 tab.asize
0006 > p32 ABC 0005 0001
0007 rcx p32 FLOAD 0004 tab.array
0008 p32 AREF 0007 0003
0009 rdx > tab ALOAD 0008
0010 rdi p32 HREF 0009 "pull"
0011 > p32 EQ 0010 [0x40dbc458]
0012 r12 tab FLOAD 0009 tab.meta
0013 > tab NE 0012 NULL
0014 int FLOAD 0012 tab.hmask
0015 > int EQ 0014 +1
0016 r11 p32 FLOAD 0012 tab.node
0017 > p32 HREFK 0016 "__index" @1
0018 r10 > tab HLOAD 0017
0019 int FLOAD 0018 tab.hmask
0020 > int EQ 0019 +1
0021 r9 p32 FLOAD 0018 tab.node
0022 > p32 HREFK 0021 "pull" @1
0023 > fun HLOAD 0022
0024 > fun EQ 0023 0x40ce9090:1
0025 r8 tab FLOAD 0x40ce9090:1 func.env
0026 int FLOAD 0025 tab.hmask
0027 > int EQ 0026 +63
0028 rsi p32 FLOAD 0025 tab.node
0029 > p32 HREFK 0028 "count" @15
0030 xmm0 > num HLOAD 0029
0031 > p32 HREFK 0028 "threshold" @11
0032 > num HLOAD 0031
.... SNAP #1 [ ---- ---- ---- ---- ---- ---- ---- ---- 0003 0001 ---- 0003 0x40ce9090:1|0009 ]
0033 > num UGE 0032 0030
0034 rbp + int ADD 0003 +1
.... SNAP #2 [ ---- ---- ---- ---- ---- ---- ---- ---- ]
0035 > int LE 0034 0001
.... SNAP #3 [ ---- ---- ---- ---- ---- ---- ---- ---- 0034 0001 ---- 0034 ]
0036 ------------ LOOP ------------
0037 p32 AREF 0007 0034
0038 rbx > tab ALOAD 0037
0039 rdi p32 HREF 0038 "pull"
0040 > p32 EQ 0039 [0x40dbc458]
0041 rbx tab FLOAD 0038 tab.meta
0042 > tab NE 0041 NULL
0043 int FLOAD 0041 tab.hmask
0044 > int EQ 0043 +1
0045 rbx p32 FLOAD 0041 tab.node
0046 > p32 HREFK 0045 "__index" @1
0047 rbx > tab HLOAD 0046
0048 int FLOAD 0047 tab.hmask
0049 > int EQ 0048 +1
0050 rbx p32 FLOAD 0047 tab.node
0051 > p32 HREFK 0050 "pull" @1
0052 > fun HLOAD 0051
0053 > fun EQ 0052 0x40ce9090:1
0054 rbp + int ADD 0034 +1
.... SNAP #4 [ ---- ---- ---- ---- ---- ---- ---- ---- ]
0055 > int LE 0054 0001
0056 rbp int PHI 0034 0054
This looks like a good trace, but note the guard for the address of the pull
function
0019 int FLOAD 0018 tab.hmask
0020 > int EQ 0019 +1
0021 r9 p32 FLOAD 0018 tab.node
0022 > p32 HREFK 0021 "pull" @1
0023 > fun HLOAD 0022
0024 > fun EQ 0023 0x40ce9090:1
This address is different for each pseudo-app. What happens right after the compiler has finished recording this loop iteration? Since this was a successful trace, the resulting code is optimized and compiled to a proper looping root trace and the system continues by executing the next iteration of the loop from that freshly compiled trace. However, in that very next iteration the mentioned guard fails, because the pull
method of the next app is different from the previous one. As a consequence, the trace is left to the interpreter via exit #0 right away and before the actual loop has even be reached.
After this has happened 10 times (the default number of passes through a side exit after which it gets hot) the compiler starts recording a side trace
---- TRACE 3 start 2/0 breathsim.lua:37
0057 TGETV 11 3 10
0058 MOV 12 11
0059 TGETS 11 11 15 ; "pull"
0060 CALL 11 1 2
0000 . FUNCF 6 ; 0x40ce8c18:1
0001 . GGET 1 0 ; "count"
0002 . GGET 2 1 ; "threshold"
0003 . ISGE 2 1
0004 . JMP 1 => 0013
0013 . RET0 0 1
0061 JFORL 7 2
---- TRACE 3 IR
.... SNAP #0 [ ---- ]
0001 rax > int SLOAD #9 CRI
0002 > int LE 0001 +2147483646
0003 rbp int SLOAD #8 CI
0004 r13 > tab SLOAD #4 T
0005 int FLOAD 0004 tab.asize
0006 > p32 ABC 0005 0001
0007 rcx p32 FLOAD 0004 tab.array
0008 p32 AREF 0007 0003
0009 rdx > tab ALOAD 0008
0010 rdi p32 HREF 0009 "pull"
0011 > p32 EQ 0010 [0x40dbc458]
0012 r12 tab FLOAD 0009 tab.meta
0013 > tab NE 0012 NULL
0014 int FLOAD 0012 tab.hmask
0015 > int EQ 0014 +1
0016 r11 p32 FLOAD 0012 tab.node
0017 > p32 HREFK 0016 "__index" @1
0018 r10 > tab HLOAD 0017
0019 int FLOAD 0018 tab.hmask
0020 > int EQ 0019 +1
0021 r9 p32 FLOAD 0018 tab.node
0022 > p32 HREFK 0021 "pull" @1
0023 > fun HLOAD 0022
0024 > fun EQ 0023 0x40ce8c18:1
0025 r8 tab FLOAD 0x40ce8c18:1 func.env
0026 int FLOAD 0025 tab.hmask
0027 > int EQ 0026 +63
0028 rsi p32 FLOAD 0025 tab.node
0029 > p32 HREFK 0028 "count" @15
0030 xmm0 > num HLOAD 0029
0031 > p32 HREFK 0028 "threshold" @11
0032 > num HLOAD 0031
.... SNAP #1 [ ---- ---- ---- ---- ---- ---- ---- ---- 0003 0001 ---- 0003 0x40ce8c18:1|0009 ]
0033 > num UGE 0032 0030
0034 rbp + int ADD 0003 +1
.... SNAP #2 [ ---- ---- ---- ---- ---- ---- ---- ---- ]
0035 > int LE 0034 0001
.... SNAP #3 [ ---- ---- ---- ---- ---- ---- ---- ---- 0034 0001 ---- 0034 ]
0036 ------------ LOOP ------------
0037 p32 AREF 0007 0034
0038 rbx > tab ALOAD 0037
0039 rdi p32 HREF 0038 "pull"
0040 > p32 EQ 0039 [0x40dbc458]
0041 rbx tab FLOAD 0038 tab.meta
0042 > tab NE 0041 NULL
0043 int FLOAD 0041 tab.hmask
0044 > int EQ 0043 +1
0045 rbx p32 FLOAD 0041 tab.node
0046 > p32 HREFK 0045 "__index" @1
0047 rbx > tab HLOAD 0046
0048 int FLOAD 0047 tab.hmask
0049 > int EQ 0048 +1
0050 rbx p32 FLOAD 0047 tab.node
0051 > p32 HREFK 0050 "pull" @1
0052 > fun HLOAD 0051
0053 > fun EQ 0052 0x40ce8c18:1
0054 rbp + int ADD 0034 +1
.... SNAP #4 [ ---- ---- ---- ---- ---- ---- ---- ---- ]
0055 > int LE 0054 0001
0056 rbp int PHI 0034 0054
You can see that the address of the pull
method is now 0x40ce8c18
when it was 0x40ce9090
in trace #2. Normally, a side trace cannot form a loop but this case is an exception, because the exit happens so early that the code has not yet been specialized enough to prohibit the formation of a new loop. The loop that is being recorded in trace #3 looks exactly like the previous trace and, in fact, the exact same thing is happening with it, i.e. the trace is left right in the next iteration.
The same game is repeated in traces #4 and #5. Sometimes it will happen that one of those traces is entered with the expected pull
method and execution proceeds into the actual loop but since the next iteration will always cause the same failure of the guard, it will never reach the PHI
at the end of the loop. Eventually, such an exit will become hot as well, which is the case for trace #6, which is started from exit #3 of trace #3. It executes the pull
method and returns to the loop which has been compiled in trace #2 and therefore links directly to that trace.
Trace #7 is again like #2, #3, #4 and #5 and traces #8 #9 and #10 are of the same kind as #6.
The good news is that everything is running compiled. The bad thing is that none of the optimized loops are actually executed.
Let's compare this with the "flattened" loop
$ sudo ./snabb snsh -jdump=+rs,/tmp/sim-empty-flat.txt breathsim.lua flat 1e20
elapsed 0.281710828 seconds
We can already see that the elapsed time is smaller by a factor of almost 30. The program compiles to a single trace
---- TRACE 2 start breathsim.lua:42
0070 KSHORT 7 1
0071 ISGE 1 7
0072 JMP 8 => 0074
0074 TGETV 8 3 7
0075 MOV 9 8
0076 TGETS 8 8 15 ; "pull"
0077 CALL 8 1 2
0000 . FUNCF 6 ; 0x418d2eb0:1
0001 . GGET 1 0 ; "count"
0002 . GGET 2 1 ; "threshold"
0003 . ISGE 2 1
0004 . JMP 1 => 0013
0013 . RET0 0 1
0078 ADDVN 7 7 1 ; 1
0079 JMP 8 => 0071
0071 ISGE 1 7
0072 JMP 8 => 0074
0074 TGETV 8 3 7
0075 MOV 9 8
0076 TGETS 8 8 15 ; "pull"
0077 CALL 8 1 2
0000 . FUNCF 6 ; 0x418c68a0:1
0001 . GGET 1 0 ; "count"
0002 . GGET 2 1 ; "threshold"
0003 . ISGE 2 1
0004 . JMP 1 => 0013
0013 . RET0 0 1
0078 ADDVN 7 7 1 ; 1
0079 JMP 8 => 0071
0071 ISGE 1 7
0072 JMP 8 => 0074
0074 TGETV 8 3 7
0075 MOV 9 8
0076 TGETS 8 8 15 ; "pull"
0077 CALL 8 1 2
0000 . FUNCF 6 ; 0x412b9d78:1
0001 . GGET 1 0 ; "count"
0002 . GGET 2 1 ; "threshold"
0003 . ISGE 2 1
0004 . JMP 1 => 0013
0013 . RET0 0 1
0078 ADDVN 7 7 1 ; 1
0079 JMP 8 => 0071
0071 ISGE 1 7
0072 JMP 8 => 0074
0074 TGETV 8 3 7
0075 MOV 9 8
0076 TGETS 8 8 15 ; "pull"
0077 CALL 8 1 2
0000 . FUNCF 6 ; 0x418d7cb8:1
0001 . GGET 1 0 ; "count"
0002 . GGET 2 1 ; "threshold"
0003 . ISGE 2 1
0004 . JMP 1 => 0013
0013 . RET0 0 1
0078 ADDVN 7 7 1 ; 1
0079 JMP 8 => 0071
0071 ISGE 1 7
0072 JMP 8 => 0074
0074 TGETV 8 3 7
0075 MOV 9 8
0076 TGETS 8 8 15 ; "pull"
0077 CALL 8 1 2
0000 . FUNCF 6 ; 0x41fe4698:1
0001 . GGET 1 0 ; "count"
0002 . GGET 2 1 ; "threshold"
0003 . ISGE 2 1
0004 . JMP 1 => 0013
0013 . RET0 0 1
0078 ADDVN 7 7 1 ; 1
0079 JMP 8 => 0071
0071 ISGE 1 7
0072 JMP 8 => 0074
0073 JMP 8 => 0080
0080 GGET 8 6 ; "count"
0081 ADDVN 8 8 1 ; 1
0082 GSET 8 6 ; "count"
0083 GGET 8 6 ; "count"
0084 ISGE 2 8
0085 JMP 8 => 0069
0069 LOOP 7 => 0086
---- TRACE 2 IR
.... SNAP #0 [ ---- ---- ---- ---- ---- ---- ---- ---- ]
0001 xmm10 > num SLOAD #2 T
.... SNAP #1 [ ---- ---- ---- ---- ---- ---- ---- ---- +1 ]
0002 > num UGE 0001 +1
.... SNAP #2 [ ---- ---- ---- ---- ---- ---- ---- ---- +1 ]
0003 rax > tab SLOAD #4 T
0004 int FLOAD 0003 tab.asize
0005 > int ABC 0004 +5
0006 rax p32 FLOAD 0003 tab.array
0007 p32 AREF 0006 +1
0008 r10 > tab ALOAD 0007
0009 rdi p32 HREF 0008 "pull"
0010 > p32 EQ 0009 [0x41bb5458]
0011 rcx tab FLOAD 0008 tab.meta
0012 > tab NE 0011 NULL
0013 int FLOAD 0011 tab.hmask
0014 > int EQ 0013 +1
0015 rcx p32 FLOAD 0011 tab.node
0016 > p32 HREFK 0015 "__index" @1
0017 rcx > tab HLOAD 0016
0018 int FLOAD 0017 tab.hmask
0019 > int EQ 0018 +1
0020 rcx p32 FLOAD 0017 tab.node
0021 > p32 HREFK 0020 "pull" @1
0022 > fun HLOAD 0021
0023 > fun EQ 0022 0x418d2eb0:1
0024 rcx tab FLOAD 0x418d2eb0:1 func.env
0025 int FLOAD 0024 tab.hmask
0026 > int EQ 0025 +63
0027 rcx p32 FLOAD 0024 tab.node
0028 r11 > p32 HREFK 0027 "count" @15
0029 xmm3 > num HLOAD 0028
0030 > p32 HREFK 0027 "threshold" @11
0031 xmm8 > num HLOAD 0030
.... SNAP #3 [ ---- ---- ---- ---- ---- ---- ---- ---- +1 0x418d2eb0:1|0008 ]
0032 > num UGE 0031 0029
.... SNAP #4 [ ---- ---- ---- ---- ---- ---- ---- ---- +2 ]
0033 > num UGE 0001 +2
.... SNAP #5 [ ---- ---- ---- ---- ---- ---- ---- ---- +2 ]
0034 p32 AREF 0006 +2
0035 r8 > tab ALOAD 0034
0036 rdi p32 HREF 0035 "pull"
0037 > p32 EQ 0036 [0x41bb5458]
0038 rcx tab FLOAD 0035 tab.meta
0039 > tab NE 0038 NULL
0040 int FLOAD 0038 tab.hmask
0041 > int EQ 0040 +1
0042 rcx p32 FLOAD 0038 tab.node
0043 > p32 HREFK 0042 "__index" @1
0044 rcx > tab HLOAD 0043
0045 int FLOAD 0044 tab.hmask
0046 > int EQ 0045 +1
0047 rcx p32 FLOAD 0044 tab.node
0048 > p32 HREFK 0047 "pull" @1
0049 > fun HLOAD 0048
0050 > fun EQ 0049 0x418c68a0:1
0051 rcx tab FLOAD 0x418c68a0:1 func.env
0052 int FLOAD 0051 tab.hmask
0053 > int EQ 0052 +63
0054 rcx p32 FLOAD 0051 tab.node
0055 r9 > p32 HREFK 0054 "count" @15
0056 xmm2 > num HLOAD 0055
0057 > p32 HREFK 0054 "threshold" @11
0058 xmm5 > num HLOAD 0057
.... SNAP #6 [ ---- ---- ---- ---- ---- ---- ---- ---- +2 0x418c68a0:1|0035 ]
0059 > num UGE 0058 0056
.... SNAP #7 [ ---- ---- ---- ---- ---- ---- ---- ---- +3 ]
0060 > num UGE 0001 +3
.... SNAP #8 [ ---- ---- ---- ---- ---- ---- ---- ---- +3 ]
0061 p32 AREF 0006 +3
0062 rsi > tab ALOAD 0061
0063 rdi p32 HREF 0062 "pull"
0064 > p32 EQ 0063 [0x41bb5458]
0065 rcx tab FLOAD 0062 tab.meta
0066 > tab NE 0065 NULL
0067 int FLOAD 0065 tab.hmask
0068 > int EQ 0067 +1
0069 rcx p32 FLOAD 0065 tab.node
0070 > p32 HREFK 0069 "__index" @1
0071 rcx > tab HLOAD 0070
0072 int FLOAD 0071 tab.hmask
0073 > int EQ 0072 +1
0074 rcx p32 FLOAD 0071 tab.node
0075 > p32 HREFK 0074 "pull" @1
0076 > fun HLOAD 0075
0077 > fun EQ 0076 0x412b9d78:1
0078 rcx tab FLOAD 0x412b9d78:1 func.env
0079 int FLOAD 0078 tab.hmask
0080 > int EQ 0079 +63
0081 rcx p32 FLOAD 0078 tab.node
0082 rdi > p32 HREFK 0081 "count" @15
0083 xmm15 > num HLOAD 0082
0084 > p32 HREFK 0081 "threshold" @11
0085 xmm4 > num HLOAD 0084
.... SNAP #9 [ ---- ---- ---- ---- ---- ---- ---- ---- +3 0x412b9d78:1|0062 ]
0086 > num UGE 0085 0083
.... SNAP #10 [ ---- ---- ---- ---- ---- ---- ---- ---- +4 ]
0087 > num UGE 0001 +4
.... SNAP #11 [ ---- ---- ---- ---- ---- ---- ---- ---- +4 ]
0088 p32 AREF 0006 +4
0089 rbx > tab ALOAD 0088
0090 rbp p32 HREF 0089 "pull"
0091 > p32 EQ 0090 [0x41bb5458]
0092 rcx tab FLOAD 0089 tab.meta
0093 > tab NE 0092 NULL
0094 int FLOAD 0092 tab.hmask
0095 > int EQ 0094 +1
0096 rcx p32 FLOAD 0092 tab.node
0097 > p32 HREFK 0096 "__index" @1
0098 rcx > tab HLOAD 0097
0099 int FLOAD 0098 tab.hmask
0100 > int EQ 0099 +1
0101 rcx p32 FLOAD 0098 tab.node
0102 > p32 HREFK 0101 "pull" @1
0103 > fun HLOAD 0102
0104 > fun EQ 0103 0x418d7cb8:1
0105 rcx tab FLOAD 0x418d7cb8:1 func.env
0106 int FLOAD 0105 tab.hmask
0107 > int EQ 0106 +63
0108 rcx p32 FLOAD 0105 tab.node
0109 rbp > p32 HREFK 0108 "count" @15
0110 xmm13 > num HLOAD 0109
0111 > p32 HREFK 0108 "threshold" @11
0112 xmm3 > num HLOAD 0111
.... SNAP #12 [ ---- ---- ---- ---- ---- ---- ---- ---- +4 0x418d7cb8:1|0089 ]
0113 > num UGE 0112 0110
.... SNAP #13 [ ---- ---- ---- ---- ---- ---- ---- ---- +5 ]
0114 > num UGE 0001 +5
.... SNAP #14 [ ---- ---- ---- ---- ---- ---- ---- ---- +5 ]
0115 p32 AREF 0006 +5
0116 rcx > tab ALOAD 0115
0117 rdx p32 HREF 0116 "pull"
0118 > p32 EQ 0117 [0x41bb5458]
0119 rax tab FLOAD 0116 tab.meta
0120 > tab NE 0119 NULL
0121 int FLOAD 0119 tab.hmask
0122 > int EQ 0121 +1
0123 rax p32 FLOAD 0119 tab.node
0124 > p32 HREFK 0123 "__index" @1
0125 rax > tab HLOAD 0124
0126 int FLOAD 0125 tab.hmask
0127 > int EQ 0126 +1
0128 rax p32 FLOAD 0125 tab.node
0129 > p32 HREFK 0128 "pull" @1
0130 > fun HLOAD 0129
0131 > fun EQ 0130 0x41fe4698:1
0132 rax tab FLOAD 0x41fe4698:1 func.env
0133 int FLOAD 0132 tab.hmask
0134 > int EQ 0133 +63
0135 rax p32 FLOAD 0132 tab.node
0136 rdx > p32 HREFK 0135 "count" @15
0137 xmm11 > num HLOAD 0136
0138 > p32 HREFK 0135 "threshold" @11
0139 xmm2 > num HLOAD 0138
.... SNAP #15 [ ---- ---- ---- ---- ---- ---- ---- ---- +5 0x41fe4698:1|0116 ]
0140 > num UGE 0139 0137
.... SNAP #16 [ ---- ---- ---- ---- ---- ---- ---- ---- +6 ]
0141 > num LT 0001 +6
.... SNAP #17 [ ---- ---- ---- ---- ---- ---- ---- ---- +6 ]
0142 r15 fun SLOAD #0 R
0143 r14 tab FLOAD 0142 func.env
0144 int FLOAD 0143 tab.hmask
0145 > int EQ 0144 +63
0146 r13 p32 FLOAD 0143 tab.node
0147 rax > p32 HREFK 0146 "count" @15
0148 xmm7 > num HLOAD 0147
0149 xmm7 + num ADD 0148 +1
0150 num HSTORE 0147 0149
.... SNAP #18 [ ---- ---- ---- ---- ---- ---- ---- ---- +6 ---- ]
0151 xmm0 > num SLOAD #3 T
.... SNAP #19 [ ---- ---- ---- ---- ---- ---- ---- ---- ---- ]
0152 > num UGE 0151 0149
.... SNAP #20 [ ---- ---- ---- ---- ---- ---- ---- ---- ]
0153 ------------ LOOP ------------
.... SNAP #21 [ ---- ---- ---- ---- ---- ---- ---- ---- +1 ]
0154 xmm6 > num HLOAD 0028
.... SNAP #22 [ ---- ---- ---- ---- ---- ---- ---- ---- +1 0x418d2eb0:1|0008 ]
0155 > num ULE 0154 0031
.... SNAP #23 [ ---- ---- ---- ---- ---- ---- ---- ---- +2 ]
0156 xmm6 > num HLOAD 0055
.... SNAP #24 [ ---- ---- ---- ---- ---- ---- ---- ---- +2 0x418c68a0:1|0035 ]
0157 > num ULE 0156 0058
.... SNAP #25 [ ---- ---- ---- ---- ---- ---- ---- ---- +3 ]
0158 xmm6 > num HLOAD 0082
.... SNAP #26 [ ---- ---- ---- ---- ---- ---- ---- ---- +3 0x412b9d78:1|0062 ]
0159 > num ULE 0158 0085
.... SNAP #27 [ ---- ---- ---- ---- ---- ---- ---- ---- +4 ]
0160 xmm6 > num HLOAD 0109
.... SNAP #28 [ ---- ---- ---- ---- ---- ---- ---- ---- +4 0x418d7cb8:1|0089 ]
0161 > num ULE 0160 0112
.... SNAP #29 [ ---- ---- ---- ---- ---- ---- ---- ---- +5 ]
0162 xmm6 > num HLOAD 0136
.... SNAP #30 [ ---- ---- ---- ---- ---- ---- ---- ---- +5 0x41fe4698:1|0116 ]
0163 > num ULE 0162 0139
0164 xmm7 + num ADD 0149 +1
0165 num HSTORE 0147 0164
.... SNAP #31 [ ---- ---- ---- ---- ---- ---- ---- ---- ---- ]
0166 > num ULE 0164 0151
0167 xmm7 num PHI 0149 0164
Essentially, the flattened loop is like an explicit unrolling which removes the guards for the individual function prototypes. The loop is super compact and contains only the actual function calls, one after the other.
In this extreme case, where non of the apps share the same pull
functions, this is actually the optimum. But any real-world Snabb program will probably also have a significant number of distinct apps in their networks, which makes this example relevant.
Now let's look what happens if the apps get busy after a short while, e.g. 1000 iterations of the main loop. During the first 1000 iterations, the compiler will find the exact same traces as just discussed. But what happens once the pull
methods start executing their own loops?
$ sudo ./snabb snsh -jdump=+rs,/tmp/sim-std.txt breathsim.lua std 1000
elapsed 62.653934395 seconds
sim-std.txt There are 20 active traces
18.37% perf-30395.map [.] TRACE_14::(string):3
17.51% perf-30395.map [.] TRACE_11::(string):3
17.40% perf-30395.map [.] TRACE_15::(string):3
16.81% perf-30395.map [.] TRACE_12::(string):3
15.77% perf-30395.map [.] TRACE_13::(string):3
3.16% perf-30395.map [.] TRACE_2::breathsim.lua:36
2.30% perf-30395.map [.] TRACE_3::breathsim.lua:37
2.03% perf-30395.map [.] TRACE_5::breathsim.lua:37
1.79% perf-30395.map [.] TRACE_4::breathsim.lua:37
1.04% perf-30395.map [.] TRACE_7::breathsim.lua:37
0.50% perf-30395.map [.] TRACE_25::(string):7
0.33% perf-30395.map [.] TRACE_22::(string):3
0.30% perf-30395.map [.] TRACE_18::(string):3
0.27% perf-30395.map [.] TRACE_16::(string):3
0.24% perf-30395.map [.] TRACE_20::(string):3
0.23% perf-30395.map [.] TRACE_24::(string):3
0.18% perf-30395.map [.] TRACE_19::(string):7
0.17% perf-30395.map [.] TRACE_23::(string):7
0.16% perf-30395.map [.] TRACE_17::(string):7
0.16% perf-30395.map [.] TRACE_21::(string):7
As expected, we see the traces #2, #3, #4, #5 and #7 from the initial phase before the innermost loops become active. The five most active traces (#11-#15) all look like this
---- TRACE 11 start 0x41779eb0:3
0009 TGETS 5 0 2 ; "i"
0010 ADDVN 5 5 0 ; 1
0011 TSETS 5 0 2 ; "i"
0012 FORL 1 => 0009
---- TRACE 11 IR
.... SNAP #0 [ ---- ]
0001 rbp int SLOAD #2 CI
0002 rdx > tab SLOAD #1 T
0003 int FLOAD 0002 tab.hmask
0004 > int EQ 0003 +1
0005 rcx p32 FLOAD 0002 tab.node
0006 rax > p32 HREFK 0005 "i" @0
0007 xmm7 > num HLOAD 0006
0008 xmm7 + num ADD 0007 +1
0009 num HSTORE 0006 0008
0010 rbp + int ADD 0001 +1
.... SNAP #1 [ ---- ---- ]
0011 > int LE 0010 +100
.... SNAP #2 [ ---- ---- 0010 ---- ---- 0010 ]
0012 ------------ LOOP ------------
0013 xmm7 + num ADD 0008 +1
0014 num HSTORE 0006 0013
0015 rbp + int ADD 0010 +1
.... SNAP #3 [ ---- ---- ]
0016 > int LE 0015 +100
0017 rbp int PHI 0010 0015
0018 xmm7 num PHI 0008 0013
Clearly, those are the pull
loops for each of the 5 apps and they are all compiled nicely, which is good. All remaining traces are simple "straight" traces that connect these loops with each other and with the initial traces.
Compare this with the flattened loop
sudo ./snabb snsh -jdump=+rs,/tmp/sim-flat.txt breathsim.lua flat 1000
elapsed 52.961463354 seconds
19.42% perf-30471.map [.] TRACE_6::(string):3
18.73% perf-30471.map [.] TRACE_7::(string):3
18.47% perf-30471.map [.] TRACE_3::(string):3
18.20% perf-30471.map [.] TRACE_5::(string):3
18.02% perf-30471.map [.] TRACE_4::(string):3
1.27% perf-30471.map [.] TRACE_10::(string):7
1.15% perf-30471.map [.] TRACE_8::(string):7
1.15% perf-30471.map [.] TRACE_11::(string):7
1.12% perf-30471.map [.] TRACE_9::(string):7
0.79% perf-30471.map [.] TRACE_2::breathsim.lua:42
0.48% perf-30471.map [.] TRACE_12::(string):7
0.27% perf-30471.map [.] TRACE_13::(string):3
We can see trace #2 from the "emtpy" phase and traces #3, #4, #5, #6 and #7 that contain the exact same looping traces for the pull
methods. Also here, there are a handful of straight traces connecting all the loops.
The difference is that in the standard scenario, there is a significantly larger part of relatively inefficient code which translates to a running time which is 20% lower than in the standard case.
As a final test, consider the case when none of the apps is idle when the program starts.
$ sudo ./snabb snsh -jdump=+rs,/tmp/sim-full-std.txt breathsim.lua std 0
elapsed 52.692324165 seconds
18.77% perf-30544.map [.] TRACE_2::(string):3
18.70% perf-30544.map [.] TRACE_4::(string):3
18.50% perf-30544.map [.] TRACE_3::(string):3
18.36% perf-30544.map [.] TRACE_5::(string):3
18.29% perf-30544.map [.] TRACE_6::(string):3
1.47% perf-30544.map [.] TRACE_11::(string):7
1.19% perf-30544.map [.] TRACE_7::(string):7
1.18% perf-30544.map [.] TRACE_8::(string):7
1.15% perf-30544.map [.] TRACE_10::(string):7
1.14% perf-30544.map [.] TRACE_9::(string):7
Here, the innermost loops are naturally compiled first and none of the side-traces observed in the "empty" scenario occur. In fact, the trace structure is essentially the same as the one in the "flat + empty" case.
How does it compare to the modified loop case?
$ sudo ./snabb snsh -jdump=+rs,/tmp/sim-full-flat.txt breathsim.lua flat 0
elapsed 52.60125439 seconds
20.03% perf-30579.map [.] TRACE_2::(string):3
19.12% perf-30579.map [.] TRACE_5::(string):3
19.00% perf-30579.map [.] TRACE_3::(string):3
17.59% perf-30579.map [.] TRACE_4::(string):3
16.99% perf-30579.map [.] TRACE_6::(string):3
1.28% perf-30579.map [.] TRACE_7::(string):7
1.27% perf-30579.map [.] TRACE_10::(string):7
1.25% perf-30579.map [.] TRACE_11::(string):7
1.25% perf-30579.map [.] TRACE_8::(string):7
1.18% perf-30579.map [.] TRACE_9::(string):7
In the "full" scenario, both looping mechanisms are equivalent.
The conclusion is that we can significantly improve the performance in case the system is idle for a while when it starts up without any negative effects for the case when the system is loaded immediately. I believe the former case is actually more common in realistic scenarios.
Based on these findings I propose to apply the "flattening" of the breathe_pull_order
and breathe_push_order
loops in core.app
.
I have tested this modification with the l2vpn
program in a production-like environment. The engine contained about 30 apps (20 different app classes) and about 60 links. In any realistic scenario, changes are very high that the system is idle after a regular start. I confirmed that it exhibits exactly the behavior described in this issue.
The "flattening" of the loops has reduced the noise of the compiler significantly, in particular when confronted with changing workloads (e.g. direction of bulk traffic from encap-heavy to decap-heavy, from ipv4 only to ipv6 only, from unfragmented to fragmented traffic). The observed effect is that there are much fewer aborted traces and much fewer implicit blacklistings of side traces. Performance has become much more stable across all workloads and "catastrophic JIT failures" have practically vanished. Those failures would result in highly suboptimal traces or even substantial interpreter fallback (>50%) when the system was exposed to a wide range of different workloads.
I am confident enough to include this change in the next production release of l2vpn
. I would be very much interested to see the effect this will have on other Snabb programs.