snabb icon indicating copy to clipboard operation
snabb copied to clipboard

[raptorjit] Poor performance on fourth packet-filtering selftest

Open wingo opened this issue 7 years ago • 7 comments

On upstream Snabb:

$  make && sudo ./snabb snsh -t apps.packet_filter.pcap_filter
make: 'snabb' is up to date.
selftest: pcap_filter
Run for 1 second (stateful = false)...
link report:
             641,333 sent on pcap_filter.output -> sink.input (loss rate: 0%)
          17,209,134 sent on repeater.output -> pcap_filter.input (loss rate: 0%)
                 161 sent on source.output -> repeater.input (loss rate: 0%)
ok: accepted 3.7267% of inputs (within tolerance)
Run for 1 second (stateful = true)...
link report:
             717,983 sent on pcap_filter.output -> sink.input (loss rate: 0%)
           9,632,982 sent on repeater.output -> pcap_filter.input (loss rate: 0%)
                 161 sent on source.output -> repeater.input (loss rate: 0%)
ok: accepted 7.4534% of inputs (within tolerance)
Run for 1 second (stateful = false)...
link report:
             564,233 sent on pcap_filter.output -> sink.input (loss rate: 0%)
          15,140,268 sent on repeater.output -> pcap_filter.input (loss rate: 0%)
                 161 sent on source.output -> repeater.input (loss rate: 0%)
ok: accepted 3.7267% of inputs (within tolerance)
Run for 1 second (stateful = true)...
link report:
             556,643 sent on pcap_filter.output -> sink.input (loss rate: 0%)
           7,468,338 sent on repeater.output -> pcap_filter.input (loss rate: 0%)
                 161 sent on source.output -> repeater.input (loss rate: 0%)
ok: accepted 7.4534% of inputs (within tolerance)
selftest: ok

Note that the performance results are quite variable; try it. However, compare to raptorjit:

$ make && sudo ./snabb snsh -t apps.packet_filter.pcap_filter
make: 'snabb' is up to date.
selftest: pcap_filter
Run for 1 second (stateful = false)...
link report:
             654,597 sent on pcap_filter.output -> sink.input (loss rate: 0%)
          17,565,012 sent on repeater.output -> pcap_filter.input (loss rate: 0%)
                 161 sent on source.output -> repeater.input (loss rate: 0%)
ok: accepted 3.7267% of inputs (within tolerance)
Run for 1 second (stateful = true)...
link report:
             658,485 sent on pcap_filter.output -> sink.input (loss rate: 0%)
           8,834,730 sent on repeater.output -> pcap_filter.input (loss rate: 0%)
                 161 sent on source.output -> repeater.input (loss rate: 0%)
ok: accepted 7.4534% of inputs (within tolerance)
Run for 1 second (stateful = false)...
link report:
           1,017,710 sent on pcap_filter.output -> sink.input (loss rate: 0%)
          27,308,562 sent on repeater.output -> pcap_filter.input (loss rate: 0%)
                 161 sent on source.output -> repeater.input (loss rate: 0%)
ok: accepted 3.7267% of inputs (within tolerance)
Run for 1 second (stateful = true)...
link report:
                 295 sent on pcap_filter.output -> sink.input (loss rate: 0%)
               3,978 sent on repeater.output -> pcap_filter.input (loss rate: 0%)
                 161 sent on source.output -> repeater.input (loss rate: 0%)
error: accepted 7.4158% (expected 7.453% +/- 0.00100)
apps/packet_filter/pcap_filter.lua:132: selftest failed

This poor performance number appears to be reproducible. I assume that the test failure simply indicates that not enough samples were collected.

wingo avatar Mar 20 '18 12:03 wingo

Blocks https://github.com/snabbco/snabb/pull/1264

wingo avatar Mar 20 '18 12:03 wingo

This bug is a bit weird. What the selftest does is attempt to run a test case for a while, with stateful filtering on and off, and also with native compilation on and off. The error I was initially seeing was the fourth combination: stateful filtering on, and native compilation on.

However those conditions aren't necessary to reproduce the bug. Simply change the selftest to do this:

function selftest ()
   print("selftest: pcap_filter")
   for i=1,10 do
      selftest_run(false, 3.726, 0.0009)
   end
   print("selftest: ok")
end

That will run with the default Lua-generating pflua backend, and with no stateful filtering. It fails for me on the fourth iteration:

wingo@sparrow ~/src/raptorjit-snabb/src$ make && sudo strace -ttTv -o /tmp/log ./snabb snsh -t apps.packet_filter.pcap_filter
make: 'snabb' is up to date.
selftest: pcap_filter
Run for 1 second (stateful = false)...
link report:
             634,275 sent on pcap_filter.output -> sink.input (loss rate: 0%)
          17,019,720 sent on repeater.output -> pcap_filter.input (loss rate: 0%)
                 161 sent on source.output -> repeater.input (loss rate: 0%)
ok: accepted 3.7267% of inputs (within tolerance)
Run for 1 second (stateful = false)...
link report:
             503,751 sent on pcap_filter.output -> sink.input (loss rate: 0%)
          13,517,346 sent on repeater.output -> pcap_filter.input (loss rate: 0%)
                 161 sent on source.output -> repeater.input (loss rate: 0%)
ok: accepted 3.7267% of inputs (within tolerance)
Run for 1 second (stateful = false)...
link report:
             478,383 sent on pcap_filter.output -> sink.input (loss rate: 0%)
          12,836,598 sent on repeater.output -> pcap_filter.input (loss rate: 0%)
                 161 sent on source.output -> repeater.input (loss rate: 0%)
ok: accepted 3.7267% of inputs (within tolerance)
Run for 1 second (stateful = false)...
link report:
                 185 sent on pcap_filter.output -> sink.input (loss rate: 0%)
               4,998 sent on repeater.output -> pcap_filter.input (loss rate: 0%)
                 161 sent on source.output -> repeater.input (loss rate: 0%)
error: accepted 3.7015% (expected 3.726% +/- 0.00090)
apps/packet_filter/pcap_filter.lua:157: selftest failed

wingo avatar Mar 22 '18 14:03 wingo

I confirm that the modified selftest works fine on master with luajit; no performance degradation and no blacklist.

wingo avatar Mar 22 '18 14:03 wingo

I attempted to see if recent heuristic changes in RaptorJIT are relevant.

I reverted these patches individually:

  • f7212ccb97ea39effb9c311b5a7affc8f66d3bf2
  • 01dc8448637f80151fa8de4ede765ff81efe7536
  • ce6fbb4240fa3d626b6a1eb8ddafd3d36b335751
  • f5d810e97c1d38f3ead28f965c328dc7eac487e9

However the behavior is the same.

wingo avatar Mar 22 '18 14:03 wingo

I think I don't have time to solve this one until I am back from ONS. @lukego perhaps it would make a good test for raptorjit tooling -- without -jv I feel like I'm driving blind!

wingo avatar Mar 22 '18 14:03 wingo

Thanks for the report!

This problem seems to go away when merging the latest lukego/raptorjit-upstream branch. This could be (?) due to a new fix for a problem where RaptorJIT would start thrashing once the maxtraces threshold is reached. So performance would plummet once you recorded the 1000th trace (default setting.)

I want to take a closer look with the tooling. Just have to tweak the test to produce the audit.log. Currently that is bypassed by calling engine.breathe() directly instead of engine.main().

lukego avatar Mar 23 '18 10:03 lukego

Thank you for this fix! I can confirm this seems to fix the problem.

wingo avatar Mar 23 '18 11:03 wingo