snabb
snabb copied to clipboard
Aborts in trace due to loop unroll limit and leaving loop in root trace
Creating a PR for loop aborts I am seeing so that @lukego can take a look.
This PR is based on wingo-next and has merged in lukego/luajit-reset-penalty-cache.
The test app is at program/jit_loop/ and a helper app is at apps/jit_loop/. The app is basically a bridge which copies packets from one interface to another after changing mac and sends to next hop. Before running the app edit the PCI info, and MAC address in jit_loop.lua.
I am running this as: sudo ./snabb snsh -jdump=+rsxaA,dump.txt -jtprof -p jit_loop
Getting lots of trace abort as shown below:
---- TRACE 28 abort intel_mp.lua:787 -- failed to allocate mcode memory
---- TRACE 17 abort app.lua:511 -- leaving loop in root trace
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 22 abort app.lua:511 -- leaving loop in root trace
---- TRACE 30 abort intel_mp.lua:787 -- failed to allocate mcode memory
---- TRACE 17 abort app.lua:511 -- leaving loop in root trace
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 22 abort app.lua:511 -- leaving loop in root trace
---- TRACE 28 abort intel_mp.lua:787 -- failed to allocate mcode memory
---- TRACE 17 abort app.lua:511 -- leaving loop in root trace
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 21 abort app.lua:511 -- leaving loop in root trace
---- TRACE 28 abort intel_mp.lua:787 -- failed to allocate mcode memory
---- TRACE 16 abort app.lua:511 -- leaving loop in root trace
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 19 abort app.lua:512 -- loop unroll limit reached
---- TRACE 22 abort app.lua:511 -- leaving loop in root trace
---- TRACE 28 abort intel_mp.lua:787 -- failed to allocate mcode memory
---- TRACE 17 abort app.lua:511 -- leaving loop in root trace
---- TRACE 20 abort app.lua:512 -- loop unroll limit reached
---- TRACE 20 abort app.lua:512 -- loop unroll limit reached
---- TRACE 20 abort app.lua:512 -- loop unroll limit reached
---- TRACE 22 abort app.lua:511 -- leaving loop in root trace
---- TRACE 28 abort counter.lua:86 -- failed to allocate mcode memory
---- TRACE 15 abort app.lua:512 -- loop unroll limit reached
---- TRACE 15 abort app.lua:512 -- loop unroll limit reached
---- TRACE 15 abort app.lua:512 -- loop unroll limit reached
---- TRACE 15 abort app.lua:512 -- loop unroll limit reached
---- TRACE 17 abort app.lua:512 -- loop unroll limit reached
---- TRACE 17 abort app.lua:512 -- loop unroll limit reached
---- TRACE 17 abort app.lua:512 -- loop unroll limit reached
---- TRACE 21 abort app.lua:511 -- leaving loop in root trace
---- TRACE 38 abort intel_mp.lua:625 -- leaving loop in root trace
---- TRACE 40 abort ipv4_apps.lua:20 -- loop unroll limit reached
---- TRACE 40 abort ipv4_apps.lua:20 -- loop unroll limit reached
---- TRACE 40 abort ipv4_apps.lua:20 -- loop unroll limit reached
---- TRACE 40 abort ipv4_apps.lua:20 -- loop unroll limit reached
---- TRACE 49 abort intel_mp.lua:600 -- leaving loop in root trace
---- TRACE 70 abort intel_mp.lua:619 -- inner loop in root trace
---- TRACE 70 abort app.lua:511 -- leaving loop in root trace
---- TRACE 87 abort intel_mp.lua:619 -- inner loop in root trace
I have worked on this some more and have some additional observations. To prevent the failed to allocate mcode memory I have increased the luajit tuneables in commit a268212.
I am getting around 6mpps when running this and can see from perf top that most of the time CPU is running on compiled trace. When running under snsh using the command sudo ./snabb snsh -jdump=+rsxaA,dump01.txt -jtprof -p jit_loop the performance drops to 2mpps.
dump01.txt is attached.
dump01.txt.gz
The perf top shows the following when running under snsh:
12.65% perf-25499.map [.] 0x000000002a4d3a60
10.31% perf-25499.map [.] 0x000000002a4d5e06
7.18% perf-25499.map [.] 0x000000002a4d5e0a
3.85% perf-25499.map [.] 0x000000002a4d3a64
1.88% perf-25499.map [.] 0x000000002a4e237c
1.82% snabb [.] lj_tab_len
1.82% perf-25499.map [.] 0x000000002a4d3a6a
1.76% perf-25499.map [.] 0x000000002a4c9c7f
1.67% perf-25499.map [.] 0x000000002a4cc582
1.66% perf-25499.map [.] 0x000000002a4d5e10
1.41% perf-25499.map [.] 0x000000002a4d39da
1.21% perf-25499.map [.] 0x000000002a4d1e36
1.11% perf-25499.map [.] 0x000000002a4e2d1d
1.03% perf-25499.map [.] 0x000000002a4d5300
0.75% [vdso] [.] __vdso_clock_gettime
0.69% perf-25499.map [.] 0x000000002a4d5e31
0.66% perf-25499.map [.] 0x000000002a4d5d80
0.57% perf-25499.map [.] 0x000000002a4d3a8b
0.55% perf-25499.map [.] 0x000000002a4d20d6
0.51% [kernel] [k] native_write_msr_safe
0.50% perf-25499.map [.] 0x000000002a4d224d
0.48% perf-25499.map [.] 0x000000002a4c65b5
0.47% perf-25499.map [.] 0x000000002a4d433c
0.47% snabb [.] release_unused_segments
0.46% snabb [.] lj_tab_getinth
0.46% perf-25499.map [.] 0x000000002a4d1e6e
0.45% perf-25499.map [.] 0x000000002a4d1e73
0.40% snabb [.] lj_alloc_malloc
0.40% perf-25499.map [.] 0x000000002a4f09a9
0.39% snabb [.] lj_alloc_free
0.39% [kernel] [k] __switch_to
0.38% [kernel] [k] apic_timer_interrupt
0.37% perf-25499.map [.] 0x000000002a4f112c
0.35% perf-25499.map [.] 0x000000002a4d2029
0.34% [kernel] [k] __schedule
0.34% perf-25499.map [.] 0x000000002a4f0dd2
0.34% perf-25499.map [.] 0x000000002a4e3913
0.34% perf-25499.map [.] 0x000000002a4d20b6
0.34% perf-25499.map [.] 0x000000002a4ea2a7
0.33% perf-25499.map [.] 0x000000002a4f0f29
0.32% perf-25499.map [.] 0x000000002a4f0ba6
0.30% snabb [.] get_monotonic_time
0.28% perf-25499.map [.] 0x000000002a4d1e6a
0.28% [kernel] [k] enqueue_task_fair
0.27% perf-25499.map [.] 0x000000002a4bf504
0.27% libc-2.19.so [.] __clock_gettime
0.24% [kernel] [k] menu_select
After adding one more app at commit b7b14cc the performance dropped drastically to around 600kpps. Most of the time snabb spent on interpreted code. I am wondering what could have caused all the traces to be aborted and run in interpreted mode.
Dump file is attached, dump02.txt.gz
The perf top shows:
7.77% libc-2.19.so [.] vfprintf
5.57% snabb [.] lj_str_new
5.46% snabb [.] match
3.28% snabb [.] lj_strscan_scan
2.65% snabb [.] release_unused_segments
2.61% snabb [.] lj_alloc_free
2.40% perf-25587.map [.] 0x000000002a4d55db
2.21% libc-2.19.so [.] _IO_default_xsputn
2.06% perf-25587.map [.] 0x000000002a4d4ea5
2.04% perf-25587.map [.] 0x000000002a4d4ea9
1.36% snabb [.] match_class
1.34% snabb [.] max_expand
1.30% snabb [.] gc_sweep
1.23% perf-25587.map [.] 0x000000002a4d55df
1.15% snabb [.] str_find_aux
1.15% libc-2.19.so [.] _IO_vsprintf
0.97% libc-2.19.so [.] strchrnul
0.92% snabb [.] lj_alloc_malloc
0.73% perf-25587.map [.] 0x000000002a4c4f29
0.70% snabb [.] lj_cdata_free
0.69% perf-25587.map [.] 0x000000002a4c5858
0.66% perf-25587.map [.] 0x000000002a4d5555
0.64% perf-25587.map [.] 0x000000002a4e5d65
0.59% snabb [.] lj_tab_len
0.54% snabb [.] lj_lib_checkstr
0.53% snabb [.] lj_cdata_newv
0.50% snabb [.] lua_pushlstring
0.47% snabb [.] singlematch
0.45% libc-2.19.so [.] 0x00000000000460a0
0.42% snabb [.] lj_cont_stitch
0.40% perf-25587.map [.] 0x000000002a4e67fc
0.40% perf-25587.map [.] 0x000000002a4d4e1f
0.38% snabb [.] classend.isra.9
0.37% perf-25587.map [.] 0x000000002a4d55e5
0.37% perf-25587.map [.] 0x000000002a4cd9e8
0.36% snabb [.] lj_strscan_num
0.33% snabb [.] lj_mem_newgco
0.32% snabb [.] push_onecapture
0.29% [kernel] [k] native_write_msr_safe
0.29% [vdso] [.] __vdso_clock_gettime
0.29% snabb [.] lj_BC_FUNCC
0.29% perf-25587.map [.] 0x000000002a4ed2b4
0.29% snabb [.] lj_BC_JLOOP
0.28% perf-25587.map [.] 0x000000002a4d444e
0.28% [kernel] [k] __switch_to
0.28% perf-25587.map [.] 0x000000002a4d4eaf
0.27% perf-25587.map [.] 0x000000002a4ecf87
Just wondering which aborted trace could be behind the high use of interpreted code? How can that be identified? It may be very well be possible for many trace aborts to be normal and not have a bearing on performance, but some could be. feeling kind of lost here :(
Thank you for using a Pull Request to raise this issue. It is very helpful to have the exact code you are working with connected with the bug report. That way the source code line numbers in the trace can be looked up in the right code version. :+1:
The JIT log does not explicitly say what is being blacklisted, but it does include a lot of aborts for the same region of code, and that seems likely to be the problem (or in any event something worth understanding.)
130930:---- TRACE 112 start 95/18 intel_mp.lua:595
134405:---- TRACE 112 abort intel_mp.lua:596 -- loop unroll limit reached
134408:---- TRACE 112 start 95/18 intel_mp.lua:595
137883:---- TRACE 112 abort intel_mp.lua:596 -- loop unroll limit reached
137886:---- TRACE 112 start 95/18 intel_mp.lua:595
141361:---- TRACE 112 abort intel_mp.lua:596 -- loop unroll limit reached
141364:---- TRACE 112 start 95/18 intel_mp.lua:595
144839:---- TRACE 112 abort intel_mp.lua:596 -- loop unroll limit reached
My working hypothesis is that this code is being blacklisted and this is causing some important code path to run interpreted. Could be that raptorjit/raptorjit#102 is the solution i.e. make the JIT give up on the unrolling optimization after a while instead of giving up on the whole trace.
We have seen other instances of blacklistings for code starting in intel_mp lately right? Does anybody already have some insight into these? (I have "swapped out" the traces that I was looking at before vacation but maybe somebody can refresh my memory? cc @alexandergall @wingo @eugeneia)
I am seeing quite a lot of trace aborts related to loop, not just in
intel_mp, but in other modules also. The code in PR is just a small
sample we can use for diagnosis.
On Thu, Nov 2, 2017 at 5:51 PM, Luke Gorrie [email protected] wrote:
Thank you for using a Pull Request to raise this issue. It is very helpful to have the exact code you are working with connected with the bug report. That way the source code line numbers in the trace can be looked up in the right code version. 👍
The JIT log does not explicitly say what is being blacklisted, but it does include a lot of aborts for the same region of code, and that seems likely to be the problem (or in any event something worth understanding.)
130930:---- TRACE 112 start 95/18 intel_mp.lua:595 134405:---- TRACE 112 abort intel_mp.lua:596 -- loop unroll limit reached 134408:---- TRACE 112 start 95/18 intel_mp.lua:595 137883:---- TRACE 112 abort intel_mp.lua:596 -- loop unroll limit reached 137886:---- TRACE 112 start 95/18 intel_mp.lua:595 141361:---- TRACE 112 abort intel_mp.lua:596 -- loop unroll limit reached 141364:---- TRACE 112 start 95/18 intel_mp.lua:595 144839:---- TRACE 112 abort intel_mp.lua:596 -- loop unroll limit reached
My working hypothesis is that this code is being blacklisted and this is causing some important code path to run interpreted. Could be that raptorjit/raptorjit#102 https://github.com/raptorjit/raptorjit/issues/102 is the solution i.e. make the JIT give up on the unrolling optimization after a while instead of giving up on the whole trace.
We have seen other instances of blacklistings for code starting in intel_mp lately right? Does anybody already have some insight into these? (I have "swapped out" the traces that I was looking at before vacation but maybe somebody can refresh my memory? cc @alexandergall https://github.com/alexandergall @wingo https://github.com/wingo @eugeneia https://github.com/eugeneia)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/snabbco/snabb/pull/1239#issuecomment-341404408, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB3whF9LCWyu7R_SCdnwejFx6_xrBPVks5sybOxgaJpZM4PsWIb .
@raj2569 Could you please try merging #1242 and post a dump? I hope this will make the JIT trace easier to read by breaking the traces up into smaller parts (always start a new trace when entering/exiting an app callback to avoid mixing too much code together.)
Merged lukego/jit-tracebarrier and ran the tests again. Dumpfile is attached. dump-with-1242.txt.gz
I am using this problem as a test case for improved profiler functions. Have updated the profiler backend in raptorjit/raptorjit#124 with information that should point to a solution, and now I need to update the Studio frontend so that we can see what it means :).