snabb [raptorjit] Poor performance on fourth packet-filtering selftest

On upstream Snabb:

$  make && sudo ./snabb snsh -t apps.packet_filter.pcap_filter
make: 'snabb' is up to date.
selftest: pcap_filter
Run for 1 second (stateful = false)...
link report:
             641,333 sent on pcap_filter.output -> sink.input (loss rate: 0%)
          17,209,134 sent on repeater.output -> pcap_filter.input (loss rate: 0%)
                 161 sent on source.output -> repeater.input (loss rate: 0%)
ok: accepted 3.7267% of inputs (within tolerance)
Run for 1 second (stateful = true)...
link report:
             717,983 sent on pcap_filter.output -> sink.input (loss rate: 0%)
           9,632,982 sent on repeater.output -> pcap_filter.input (loss rate: 0%)
                 161 sent on source.output -> repeater.input (loss rate: 0%)
ok: accepted 7.4534% of inputs (within tolerance)
Run for 1 second (stateful = false)...
link report:
             564,233 sent on pcap_filter.output -> sink.input (loss rate: 0%)
          15,140,268 sent on repeater.output -> pcap_filter.input (loss rate: 0%)
                 161 sent on source.output -> repeater.input (loss rate: 0%)
ok: accepted 3.7267% of inputs (within tolerance)
Run for 1 second (stateful = true)...
link report:
             556,643 sent on pcap_filter.output -> sink.input (loss rate: 0%)
           7,468,338 sent on repeater.output -> pcap_filter.input (loss rate: 0%)
                 161 sent on source.output -> repeater.input (loss rate: 0%)
ok: accepted 7.4534% of inputs (within tolerance)
selftest: ok

Note that the performance results are quite variable; try it. However, compare to raptorjit:

$ make && sudo ./snabb snsh -t apps.packet_filter.pcap_filter
make: 'snabb' is up to date.
selftest: pcap_filter
Run for 1 second (stateful = false)...
link report:
             654,597 sent on pcap_filter.output -> sink.input (loss rate: 0%)
          17,565,012 sent on repeater.output -> pcap_filter.input (loss rate: 0%)
                 161 sent on source.output -> repeater.input (loss rate: 0%)
ok: accepted 3.7267% of inputs (within tolerance)
Run for 1 second (stateful = true)...
link report:
             658,485 sent on pcap_filter.output -> sink.input (loss rate: 0%)
           8,834,730 sent on repeater.output -> pcap_filter.input (loss rate: 0%)
                 161 sent on source.output -> repeater.input (loss rate: 0%)
ok: accepted 7.4534% of inputs (within tolerance)
Run for 1 second (stateful = false)...
link report:
           1,017,710 sent on pcap_filter.output -> sink.input (loss rate: 0%)
          27,308,562 sent on repeater.output -> pcap_filter.input (loss rate: 0%)
                 161 sent on source.output -> repeater.input (loss rate: 0%)
ok: accepted 3.7267% of inputs (within tolerance)
Run for 1 second (stateful = true)...
link report:
                 295 sent on pcap_filter.output -> sink.input (loss rate: 0%)
               3,978 sent on repeater.output -> pcap_filter.input (loss rate: 0%)
                 161 sent on source.output -> repeater.input (loss rate: 0%)
error: accepted 7.4158% (expected 7.453% +/- 0.00100)
apps/packet_filter/pcap_filter.lua:132: selftest failed

This poor performance number appears to be reproducible. I assume that the test failure simply indicates that not enough samples were collected.

Mar 20 '18 12:03 wingo

Blocks https://github.com/snabbco/snabb/pull/1264

Mar 20 '18 12:03 wingo

This bug is a bit weird. What the selftest does is attempt to run a test case for a while, with stateful filtering on and off, and also with native compilation on and off. The error I was initially seeing was the fourth combination: stateful filtering on, and native compilation on.

However those conditions aren't necessary to reproduce the bug. Simply change the selftest to do this:

function selftest ()
   print("selftest: pcap_filter")
   for i=1,10 do
      selftest_run(false, 3.726, 0.0009)
   end
   print("selftest: ok")
end

That will run with the default Lua-generating pflua backend, and with no stateful filtering. It fails for me on the fourth iteration:

wingo@sparrow ~/src/raptorjit-snabb/src$ make && sudo strace -ttTv -o /tmp/log ./snabb snsh -t apps.packet_filter.pcap_filter
make: 'snabb' is up to date.
selftest: pcap_filter
Run for 1 second (stateful = false)...
link report:
             634,275 sent on pcap_filter.output -> sink.input (loss rate: 0%)
          17,019,720 sent on repeater.output -> pcap_filter.input (loss rate: 0%)
                 161 sent on source.output -> repeater.input (loss rate: 0%)
ok: accepted 3.7267% of inputs (within tolerance)
Run for 1 second (stateful = false)...
link report:
             503,751 sent on pcap_filter.output -> sink.input (loss rate: 0%)
          13,517,346 sent on repeater.output -> pcap_filter.input (loss rate: 0%)
                 161 sent on source.output -> repeater.input (loss rate: 0%)
ok: accepted 3.7267% of inputs (within tolerance)
Run for 1 second (stateful = false)...
link report:
             478,383 sent on pcap_filter.output -> sink.input (loss rate: 0%)
          12,836,598 sent on repeater.output -> pcap_filter.input (loss rate: 0%)
                 161 sent on source.output -> repeater.input (loss rate: 0%)
ok: accepted 3.7267% of inputs (within tolerance)
Run for 1 second (stateful = false)...
link report:
                 185 sent on pcap_filter.output -> sink.input (loss rate: 0%)
               4,998 sent on repeater.output -> pcap_filter.input (loss rate: 0%)
                 161 sent on source.output -> repeater.input (loss rate: 0%)
error: accepted 3.7015% (expected 3.726% +/- 0.00090)
apps/packet_filter/pcap_filter.lua:157: selftest failed

Mar 22 '18 14:03 wingo

I confirm that the modified selftest works fine on master with luajit; no performance degradation and no blacklist.

Mar 22 '18 14:03 wingo

I attempted to see if recent heuristic changes in RaptorJIT are relevant.

I reverted these patches individually:

f7212ccb97ea39effb9c311b5a7affc8f66d3bf2
01dc8448637f80151fa8de4ede765ff81efe7536
ce6fbb4240fa3d626b6a1eb8ddafd3d36b335751
f5d810e97c1d38f3ead28f965c328dc7eac487e9

However the behavior is the same.

Mar 22 '18 14:03 wingo

I think I don't have time to solve this one until I am back from ONS. @lukego perhaps it would make a good test for raptorjit tooling -- without -jv I feel like I'm driving blind!

Mar 22 '18 14:03 wingo

Thanks for the report!

This problem seems to go away when merging the latest lukego/raptorjit-upstream branch. This could be (?) due to a new fix for a problem where RaptorJIT would start thrashing once the maxtraces threshold is reached. So performance would plummet once you recorded the 1000th trace (default setting.)

I want to take a closer look with the tooling. Just have to tweak the test to produce the audit.log. Currently that is bypassed by calling engine.breathe() directly instead of engine.main().

Mar 23 '18 10:03 lukego

Thank you for this fix! I can confirm this seems to fix the problem.

Mar 23 '18 11:03 wingo