snabb
snabb copied to clipboard
ctable performance issue related to pointer arithmetic
I started to use ctable indirectly through the IPv6 fragmentation code from apps.lwaftr. I was surprised to see much less performance than expected on the reassembly side. The profiler showed substantial amounts of interpreted code and GC occuring in the fast path.
A trace dump revealed a lot of instances of trace aborts like this
...
0014 . . . . TGETS 4 0 3 ; "remove_ptr"
0015 . . . . MOV 6 3
0016 . . . . CALL 4 1 3
0000 . . . . . FUNCF 7 ; ctable.lua:392
0001 . . . . . TGETS 2 0 0 ; "scale"
0002 . . . . . TGETS 3 0 1 ; "entries"
0003 . . . . . SUBVV 3 1 3
0000 . . . . . . . FUNCC ; ffi.meta.__sub
---- TRACE 125 abort ctable.lua:394 -- bad argument type
This led to blacklisting of crucial parts of the reassembly code by the compiler.
After a bit of googling, I found a report that described this symptom: (https://experilous.com/1/blog/post/lua-optimizations-first-pass):
"This one was nasty, and the trace abort message looked nasty too: “bad argument type”. It turns out that LuaJIT really strongly dislikes taking the difference between two pointers that point to some type whose size is not a power of 2."
The remove_ptr method of ctable uses precisely this kind of pointer arithmetic
function CTable:remove_ptr(entry)
local scale = self.scale
local index = entry - self.entries
...
Here, entry is an element of the hash table constructed by the make_entry_type() function in lib.ctable.
In case of the fragmenter, the size of such an element is not a power of 2. I then applied a simple patch that pads the element to the next power of two:
local function make_entry_type(key_type, value_type)
local cache = entry_types[key_type]
if cache then
cache = cache[value_type]
if cache then return cache end
else
entry_types[key_type] = {}
end
local raw_size = ffi.sizeof(key_type) + ffi.sizeof(value_type) + 4
local padding = 2^ceil(math.log(raw_size)/math.log(2)) - raw_size
local ret = ffi.typeof([[struct {
uint32_t hash;
$ key;
$ value;
uint8_t padding[$];
} __attribute__((packed))]],
key_type,
value_type,
padding)
entry_types[key_type][value_type] = ret
return ret
end
This got immediately rid of trace aborts and blacklisting with a huge performance boost. So, it seems to be true that LuaJIT behaves as described in the article, though I wasn't able to find the code in LuaJIT that actually does this.
I guess this is something that should go into our "performance hacks" notebook.
Instead of padding the data structure, one could also use a performance-safe replacement of the built-in subtraction meta-method (as described in the article).
While the fragmentation code runs much better now, it still suffers quite a bit from GC in my use case. Still work to do :)
Wow that's nuts, good catch! I guess our main uses have been power-of-2 sized. Another possibility would be to hack LuaJIT to turn the division into a multiplication, perhaps; http://www.hackersdelight.org/divcMore.pdf.
Regarding reassembly, incidentally I had a properly extracted version of the IPv4 reassembler that I made on Monday, moments before my laptop was stolen :P Oh well.
I created raptorjit/raptorjit#85 to track the JIT issue. I think that I found the line of code responsible at least so far.
Any plans to fix this in LuaJIT or are we waiting for the transition to RaptorJIT?