snabb
snabb copied to clipboard
Understanding "last iteration" trace aborts
I'm trying to find an explanation for a certain kind of trace aborts which are among the main causes for the blacklisting of loops by the compiler in my use cases.
As an example, I'm using code from the apps.ipv6.fragmenter app which is currently waiting to be upstreamed (https://github.com/snabbco/snabb/pull/1383 and https://github.com/snabbco/snabb/pull/1384). The actual code doesn't matter, i.e. I see the same effect all the time with various pieces of code. I did not find a reasonably small program that would exhibit the issue in a reproducible manner.
Here is the relevant part of the push() method of this particular version of the fragmenter app
for _ = 1, link.nreadable(input) do -- Code line 313
local pkt = link.receive(input)
local mtu = self.mtu
if self.pmtud then
local h = ffi.cast(ether_ipv6_header_ptr_t, pkt.data)
local entry = self.dcache:lookup_ptr(h.ipv6.dst_ip)
if entry then
mtu = entry.value.mtu
end
end
-- FIXME: assumes that there is always room to store the MTU at
-- the end of the payload.
ffi.cast("uint16_t *", pkt.data + pkt.length)[0] = mtu
if pkt.length <= mtu + ether_header_len then
-- No need to fragment; forward it on.
counter.add(self.shm["out-ipv6-frag-not"])
link.transmit(output, pkt)
else
-- Packet doesn't fit into MTU; need to fragment.
link.transmit(input, pkt)
end
end
for _ = 1, link.nreadable(input) do -- Code line 336
local pkt = link.receive(input)
local mtu = ffi.cast("uint16_t *", pkt.data + pkt.length)[0]
local next_header =
ffi.cast(ether_ipv6_header_ptr_t, pkt.data).ipv6.next_header
pkt_box[0] = pkt
self:fragment_and_transmit(next_header, pkt_box, mtu)
packet.free(pkt_box[0])
end
The line numbers that are relevant in the following JIT dumps are 313 (the start of the first loop) and 336 (the start of the second loop).
In the JIT dump, there are multiple attempts to compile the loop starting at 313 but they all fail and the loop is eventually getting blacklisted. All of these attempts come in one of exactly two flavors. The first one looks like this:
---- TRACE 584 start fragment.lua:313
0124 KSHORT 5 1
---- TRACE 584 abort fragment.lua:336 -- leaving loop in root trace
What's happening here is that the interpreter is currently processing an iteration of the loop when the FORL byte code becomes hot (there must have been at least one iteration at this point, because otherwise the FORL byte code would not be executed at all). The FORL instruction is not part of the trace but it is evaluated, which means that first the loop variable is incremented and checked against the upper limit (which was determined by link.nreadable(input) earlier when the FORI byte code was executed). If the loop condition still holds, the interpreter would jump to the loop body and recording would continue.
However, in this case it must have turned out that the loop condition no longer holds (i.e. the last iteration run before recording started was the last one) and the interpreter records the first instruction that follows the loop, which happens to be the loading of the lower bound of the loop starting at line 336, corresponding to the byte code KSHORT 5 1. The recorder notices that this instruction is not within the body of the loop and aborts.
The second flavour of aborted traces related to that loop is this:
---- TRACE 567 start fragment.lua:313
0071 UGET 9 0 ; link
0072 TGETS 9 9 7 ; "receive"
0073 MOV 10 1
0074 CALL 9 2 2
0000 . JFUNCF 5 17 ; link.lua:57
0001 . TGETS 1 0 0 ; "packets"
0000 . . . FUNCC ; ffi.meta.__index
0002 . TGETS 2 0 1 ; "read"
0000 . . . FUNCC ; ffi.meta.__index
0003 . TGETV 1 1 2
0000 . . . FUNCC ; ffi.meta.__index
0004 . UGET 2 0 ; band
0005 . TGETS 3 0 1 ; "read"
0000 . . . FUNCC ; ffi.meta.__index
0006 . ADDVN 3 3 0 ; 1
0007 . UGET 4 1 ; size
0008 . SUBVN 4 4 0 ; 1
0009 . CALL 2 2 3
0000 . . FUNCC ; bit.band
0010 . TSETS 2 0 1 ; "read"
0000 . . . FUNCC ; ffi.meta.__newindex
0011 . UGET 2 2 ; counter
0012 . TGETS 2 2 2 ; "add"
0013 . TGETS 3 0 3 ; "stats"
0000 . . . FUNCC ; ffi.meta.__index
0014 . TGETS 3 3 4 ; "rxpackets"
0000 . . . FUNCC ; ffi.meta.__index
0015 . CALL 2 1 2
0000 . . JFUNCF 4 5 ; counter.lua:91
0001 . . TGETS 2 0 0 ; "c"
0000 . . . . FUNCC ; ffi.meta.__index
0002 . . ISTC 3 1
0003 . . JMP 3 => 0005
0004 . . KSHORT 3 1
0005 . . ADDVV 2 2 3
0000 . . . . FUNCC ; ffi.meta.__add
0006 . . TSETS 2 0 0 ; "c"
0000 . . . . FUNCC ; ffi.meta.__newindex
0007 . . RET0 0 1
0016 . UGET 2 2 ; counter
0017 . TGETS 2 2 2 ; "add"
0018 . TGETS 3 0 3 ; "stats"
0000 . . . FUNCC ; ffi.meta.__index
0019 . TGETS 3 3 5 ; "rxbytes"
0000 . . . FUNCC ; ffi.meta.__index
0020 . TGETS 4 1 6 ; "length"
0000 . . . FUNCC ; ffi.meta.__index
0021 . CALL 2 1 3
0000 . . JFUNCF 4 5 ; counter.lua:91
0001 . . TGETS 2 0 0 ; "c"
0000 . . . . FUNCC ; ffi.meta.__index
0002 . . ISTC 3 1
0003 . . JMP 3 => 0005
0005 . . ADDVV 2 2 3
0000 . . . . FUNCC ; ffi.meta.__add
0006 . . TSETS 2 0 0 ; "c"
0000 . . . . FUNCC ; ffi.meta.__newindex
0007 . . RET0 0 1
0022 . RET1 1 2
0075 TGETS 10 0 18 ; "mtu"
0076 TGETS 11 0 19 ; "pmtud"
0077 ISF 11
0078 JMP 12 => 0094
0079 UGET 11 1 ; ffi
0080 TGETS 11 11 8 ; "cast"
0081 UGET 12 2 ; ether_ipv6_header_ptr_t
0082 TGETS 13 9 9 ; "data"
0000 . . FUNCC ; ffi.meta.__index
0083 CALL 11 2 3
0000 . FUNCC ; ffi.cast
0084 TGETS 12 0 20 ; "dcache"
0085 MOV 13 12
0086 TGETS 12 12 21 ; "lookup_ptr"
0087 TGETS 14 11 22 ; "ipv6"
0000 . . FUNCC ; ffi.meta.__index
0088 TGETS 14 14 23 ; "dst_ip"
0000 . . FUNCC ; ffi.meta.__index
0089 CALL 12 2 3
0000 . FUNCF 7 ; ctable.lua:367
0001 . TGETS 2 0 0 ; "hash_fn"
0002 . MOV 3 1
0003 . CALL 2 2 2
0000 . . FUNCC ; ffi.meta.__call
0004 . TGETS 3 0 1 ; "entries"
0005 . UGET 4 0 ; hash_to_index
0006 . MOV 5 2
0007 . TGETS 6 0 2 ; "scale"
0008 . CALL 4 2 3
0000 . . FUNCF 4 ; ctable.lua:69
0001 . . UGET 2 0 ; floor
0002 . . MULVV 3 0 1
0003 . . CALL 2 2 2
0000 . . . FUNCC ; math.floor
0004 . . RET1 2 2
0009 . ADDVV 3 3 4
0000 . . . FUNCC ; ffi.meta.__add
0010 . TGETS 4 3 3 ; "hash"
0000 . . . FUNCC ; ffi.meta.__index
0011 . ISNEV 2 4
0012 . JMP 4 => 0020
0020 . TGETS 4 3 3 ; "hash"
0000 . . . FUNCC ; ffi.meta.__index
0021 . ISGE 4 2
0022 . JMP 4 => 0026
0026 . TGETS 4 3 3 ; "hash"
0000 . . . FUNCC ; ffi.meta.__index
0027 . ISNEV 4 2
0028 . JMP 4 => 0039
0039 . KPRI 4 0
0040 . RET1 4 2
0090 ISF 12
0091 JMP 13 => 0094
0094 UGET 11 1 ; ffi
0095 TGETS 11 11 8 ; "cast"
0096 KSTR 12 25 ; "uint16_t *"
0097 TGETS 13 9 9 ; "data"
0000 . . FUNCC ; ffi.meta.__index
0098 TGETS 14 9 16 ; "length"
0000 . . FUNCC ; ffi.meta.__index
0099 ADDVV 13 13 14
0000 . . FUNCC ; ffi.meta.__add
0100 CALL 11 2 3
0000 . FUNCC ; ffi.cast
0101 TSETB 10 11 0
0000 . . FUNCC ; ffi.meta.__newindex
0102 TGETS 11 9 16 ; "length"
0000 . . FUNCC ; ffi.meta.__index
0103 UGET 12 8 ; ether_header_len
0104 ADDVV 12 10 12
0105 ISGT 11 12
0106 JMP 11 => 0118
0107 UGET 11 5 ; counter
0108 TGETS 11 11 12 ; "add"
0109 TGETS 12 0 13 ; "shm"
0110 TGETS 12 12 14 ; "out-ipv6-frag-not"
0111 CALL 11 1 2
0000 . JFUNCF 4 5 ; counter.lua:91
0001 . TGETS 2 0 0 ; "c"
0000 . . . FUNCC ; ffi.meta.__index
0002 . ISTC 3 1
0003 . JMP 3 => 0005
0004 . KSHORT 3 1
0005 . ADDVV 2 2 3
0000 . . . FUNCC ; ffi.meta.__add
0006 . TSETS 2 0 0 ; "c"
0000 . . . FUNCC ; ffi.meta.__newindex
0007 . RET0 0 1
0112 UGET 11 0 ; link
0113 TGETS 11 11 15 ; "transmit"
0114 MOV 12 2
0115 MOV 13 9
0116 CALL 11 1 3
0000 . JFUNCF 5 6 ; link.lua:71
0001 . GGET 2 0 ; "full"
0002 . MOV 3 0
0003 . CALL 2 2 2
0000 . . FUNCF 4 ; link.lua:90
0001 . . UGET 1 0 ; band
0002 . . TGETS 2 0 0 ; "write"
0000 . . . . FUNCC ; ffi.meta.__index
0003 . . ADDVN 2 2 0 ; 1
0004 . . UGET 3 1 ; size
0005 . . SUBVN 3 3 0 ; 1
0006 . . CALL 1 2 3
0000 . . . FUNCC ; bit.band
0007 . . TGETS 2 0 1 ; "read"
0000 . . . . FUNCC ; ffi.meta.__index
0008 . . ISEQV 1 2
0009 . . JMP 1 => 0012
0010 . . KPRI 1 1
0011 . . JMP 2 => 0013
0013 . . RET1 1 2
0004 . ISF 2
0005 . JMP 3 => 0016
0016 . TGETS 2 0 5 ; "packets"
0000 . . . FUNCC ; ffi.meta.__index
0017 . TGETS 3 0 6 ; "write"
0000 . . . FUNCC ; ffi.meta.__index
0018 . TSETV 1 2 3
0000 . . . FUNCC ; ffi.meta.__newindex
0019 . UGET 2 2 ; band
0020 . TGETS 3 0 6 ; "write"
0000 . . . FUNCC ; ffi.meta.__index
0021 . ADDVN 3 3 0 ; 1
0022 . UGET 4 3 ; size
0023 . SUBVN 4 4 0 ; 1
0024 . CALL 2 2 3
0000 . . FUNCC ; bit.band
0025 . TSETS 2 0 6 ; "write"
0000 . . . FUNCC ; ffi.meta.__newindex
0026 . UGET 2 0 ; counter
0027 . TGETS 2 2 1 ; "add"
0028 . TGETS 3 0 2 ; "stats"
0000 . . . FUNCC ; ffi.meta.__index
0029 . TGETS 3 3 7 ; "txpackets"
0000 . . . FUNCC ; ffi.meta.__index
0030 . CALL 2 1 2
0000 . . JFUNCF 4 5 ; counter.lua:91
0001 . . TGETS 2 0 0 ; "c"
0000 . . . . FUNCC ; ffi.meta.__index
0002 . . ISTC 3 1
0003 . . JMP 3 => 0005
0004 . . KSHORT 3 1
0005 . . ADDVV 2 2 3
0000 . . . . FUNCC ; ffi.meta.__add
0006 . . TSETS 2 0 0 ; "c"
0000 . . . . FUNCC ; ffi.meta.__newindex
0007 . . RET0 0 1
0031 . UGET 2 0 ; counter
0032 . TGETS 2 2 1 ; "add"
0033 . TGETS 3 0 2 ; "stats"
0000 . . . FUNCC ; ffi.meta.__index
0034 . TGETS 3 3 8 ; "txbytes"
0000 . . . FUNCC ; ffi.meta.__index
0035 . TGETS 4 1 9 ; "length"
0000 . . . FUNCC ; ffi.meta.__index
0036 . CALL 2 1 3
0000 . . JFUNCF 4 5 ; counter.lua:91
0001 . . TGETS 2 0 0 ; "c"
0000 . . . . FUNCC ; ffi.meta.__index
0002 . . ISTC 3 1
0003 . . JMP 3 => 0005
0005 . . ADDVV 2 2 3
0000 . . . . FUNCC ; ffi.meta.__add
0006 . . TSETS 2 0 0 ; "c"
0000 . . . . FUNCC ; ffi.meta.__newindex
0007 . . RET0 0 1
0037 . RET0 0 1
0117 JMP 11 => 0123
0123 FORL 5 => 0071
---- TRACE 567 abort fragment.lua:336 -- leaving loop in root trace
In this case, execution of the FORL instruction results in the recording of an actual iteration of the loop. The trace passes through the then branches of both if statements, executes link.transmit(output, pkt) and finally arrives at the FORL instruction of the first loop (which is located at the end of the loop body). Why does it abort there? The trace recorder recognizes that this iteration happened to have been the last one for this evaluation of the loop. Therefore, it cannot record the actual loop itself and must abort.
Here is the complete sequence of aborted traces that eventually lead to blacklisting of the FORL byte code of the loop starting at line 313
---- TRACE 567 start fragment.lua:313
0071 UGET 9 0 ; link
...
0123 FORL 5 => 0071
---- TRACE 567 abort fragment.lua:336 -- leaving loop in root trace
---- TRACE 574 start fragment.lua:313
0124 KSHORT 5 1
---- TRACE 574 abort fragment.lua:336 -- leaving loop in root trace
---- TRACE 584 start fragment.lua:313
0124 KSHORT 5 1
---- TRACE 584 abort fragment.lua:336 -- leaving loop in root trace
---- TRACE 585 start fragment.lua:313
0071 UGET 9 0 ; link
...
0123 FORL 5 => 0071
---- TRACE 585 abort fragment.lua:336 -- leaving loop in root trace
---- TRACE 597 start fragment.lua:313
0124 KSHORT 5 1
---- TRACE 597 abort fragment.lua:336 -- leaving loop in root trace
---- TRACE 601 start fragment.lua:313
0071 UGET 9 0 ; link
...
0123 FORL 5 => 0071
---- TRACE 601 abort fragment.lua:336 -- leaving loop in root trace
---- TRACE 601 start fragment.lua:313
0124 KSHORT 5 1
---- TRACE 601 abort fragment.lua:336 -- leaving loop in root trace
---- TRACE 603 start fragment.lua:313
0124 KSHORT 5 1
---- TRACE 603 abort fragment.lua:336 -- leaving loop in root trace
---- TRACE 608 start fragment.lua:313
0071 UGET 9 0 ; link
...
0123 FORL 5 => 0071
---- TRACE 608 abort fragment.lua:336 -- leaving loop in root trace
---- TRACE 611 start fragment.lua:313
0124 KSHORT 5 1
---- TRACE 611 abort fragment.lua:336 -- leaving loop in root trace
---- TRACE 613 start fragment.lua:313
0071 UGET 9 0 ; link
...
0123 FORL 5 => 0071
---- TRACE 613 abort fragment.lua:336 -- leaving loop in root trace
After the abort of trace 613, the FORL is replaced by a IFROL and all traces passing through it from then on abort with something like
0123 IFORL 5 => 0071
---- TRACE 634 abort fragment.lua:313 -- blacklisted
What all of this means is, I guess, that this loop appears to have exactly either one or two iterations left whenever it becomes hot until it is blacklisted. But how can that be?
The loop starts out with the default hot counter (it is counted down and the value of zero triggers the recorder). The first thing to note is that the loop byte codes don't have individual hop counters but use a pretty small hash table of counters indexed by the address of the byte code (essentially). That means that a particular loop can be traced earlier than expected, but that does no harm and seems not related to the issue under discussion.
Each time the recording is aborted, the hot counter is reset and a penalty value is added to it so it takes more iterations until the next recording is attempted. The penalty value is not constant but includes some pseudo-randomness.
Given all that, how can it be that recording always happens for the last one or two iterations? The only reasonable explanation seems to be that the loop actually never has more than two packets to process over an extended period of time. I guess this can, in fact, happen occasionally but I'm seeing the same behavior over and over with all sorts of loops that makes this explanation look pretty improbable to me.
I'm looking for a deeper understanding of this phenomenon because it seems to be at the core of some of the performance issues I see when my programs are subjected to certain changes of the workload.
Great writeup!
I have to read the JIT code and think about this but please indulge me in a hot-take :-)
Is this just a bug in lj_record.c? If the JIT is waiting for the program counter to return to the start of the loop in order to complete the trace, and this is not happening because the loop is terminating, then perhaps the JIT should instead complete the trace when it reaches the FORL i.e. "pretend" for the purpose of code generation that the branch back into the loop is taken? This seems straightforward if the taken branch was about to immediately complete the loop without recording any further instructions.
Generally I am bugged whenever heuristics like "leaving loop in root trace" lead to blacklistings. Just shouldn't happen IMHO.
I have been able to reproduce this with a minimal example over at raptorjit/raptorjit#203.