snabb ctable performance issue related to pointer arithmetic

I started to use ctable indirectly through the IPv6 fragmentation code from apps.lwaftr. I was surprised to see much less performance than expected on the reassembly side. The profiler showed substantial amounts of interpreted code and GC occuring in the fast path.

A trace dump revealed a lot of instances of trace aborts like this

...
0014  . . . . TGETS    4   0   3  ; "remove_ptr"
0015  . . . . MOV      6   3
0016  . . . . CALL     4   1   3
0000  . . . . . FUNCF    7          ; ctable.lua:392
0001  . . . . . TGETS    2   0   0  ; "scale"
0002  . . . . . TGETS    3   0   1  ; "entries"
0003  . . . . . SUBVV    3   1   3
0000  . . . . . . . FUNCC               ; ffi.meta.__sub
---- TRACE 125 abort ctable.lua:394 -- bad argument type

This led to blacklisting of crucial parts of the reassembly code by the compiler.

After a bit of googling, I found a report that described this symptom: (https://experilous.com/1/blog/post/lua-optimizations-first-pass):

"This one was nasty, and the trace abort message looked nasty too: “bad argument type”. It turns out that LuaJIT really strongly dislikes taking the difference between two pointers that point to some type whose size is not a power of 2."

The remove_ptr method of ctable uses precisely this kind of pointer arithmetic

function CTable:remove_ptr(entry)
   local scale = self.scale
   local index = entry - self.entries
...

Here, entry is an element of the hash table constructed by the make_entry_type() function in lib.ctable.

In case of the fragmenter, the size of such an element is not a power of 2. I then applied a simple patch that pads the element to the next power of two:

local function make_entry_type(key_type, value_type)
   local cache = entry_types[key_type]
   if cache then
      cache = cache[value_type]
      if cache then return cache end
   else
      entry_types[key_type] = {}
   end
   local raw_size = ffi.sizeof(key_type) + ffi.sizeof(value_type) + 4
  local padding = 2^ceil(math.log(raw_size)/math.log(2)) - raw_size
   local ret = ffi.typeof([[struct {
         uint32_t hash;
         $ key;
         $ value;
         uint8_t padding[$];
      } __attribute__((packed))]],
      key_type,
      value_type,
      padding)
   entry_types[key_type][value_type] = ret
   return ret
end

This got immediately rid of trace aborts and blacklisting with a huge performance boost. So, it seems to be true that LuaJIT behaves as described in the article, though I wasn't able to find the code in LuaJIT that actually does this.

I guess this is something that should go into our "performance hacks" notebook.

Instead of padding the data structure, one could also use a performance-safe replacement of the built-in subtraction meta-method (as described in the article).

While the fragmentation code runs much better now, it still suffers quite a bit from GC in my use case. Still work to do :)

Aug 11 '17 15:08 alexandergall

Wow that's nuts, good catch! I guess our main uses have been power-of-2 sized. Another possibility would be to hack LuaJIT to turn the division into a multiplication, perhaps; http://www.hackersdelight.org/divcMore.pdf.

Regarding reassembly, incidentally I had a properly extracted version of the IPv4 reassembler that I made on Monday, moments before my laptop was stolen :P Oh well.

Aug 11 '17 17:08 wingo

I created raptorjit/raptorjit#85 to track the JIT issue. I think that I found the line of code responsible at least so far.

Aug 11 '17 18:08 lukego

Any plans to fix this in LuaJIT or are we waiting for the transition to RaptorJIT?

May 25 '18 11:05 alexandergall