dynamorio
dynamorio copied to clipboard
Collect warmup trace (using L0_filter) followed by a full instruction trace
Add a pointer to any prior users list discussion. https://groups.google.com/g/dynamorio-users/c/fn6LBUxTd2o/m/aevZa4ZRAAAJ
Is your feature request related to a problem? Please describe. It might be useful to have a "pre-trace" that can be used to warm up large caches before we start simulating instruction traces.
Describe the solution you'd like
We can have an "L0 warmup" phase just like how we have a fast forward phase (-trace_after_instrs) currently. During this phase, the L0_filter would be active. Once we reach the specified instruction count, we would switch to the regular trace instrumentation.
An example run might look something like this:
drrun -t drcachesim -trace_after_instrs 1000M -L0_warmup_refs 1M --offline -- ../app
Do you have any implementation in mind for this feature? This could be implemented with an approach similar to the 'trace_after_instrs' feature. We start off with the L0_filter instrumentation and then flush the code cache and enable regular trace instrumentation when the switching criteria is reached.
Describe alternatives you've considered We could just collect a really long trace but that might have higher tracing overhead, higher disk usage etc. We could use DrBBDup (#4134) but there are some TODOs still open for this feature.
Additional context
For an online trace, how are the two phases distinguished in analysis tools? Is a new instance of drcachesim spun up for the switch? Or are the trace entries just run together -- with some kind of marker in between like the offline file type marker??
The same question for an offline trace: though here it's easier to have completely separate output files. I assume that's the plan.
Yes, for offline traces we will have separate files. I have been focusing on offline traces and hadn't thought about online trace. I think the marker idea sounds good -- that way the analyzer can have control on what it wants to do with data from the different phases.
While switching instrumentation, I need to ensure that the regs are restored. Therefore, I tried to change clean_call to use DR_CLEANCALL_READS_APP_CONTEXT. I ran into an assert DynamoRIO usage error : encode_cti error: target beyond 8-bit reach after making this change.
This is the change I made
diff --git a/clients/drcachesim/tracer/tracer.cpp b/clients/drcachesim/tracer/tracer.cpp
index c97ed93..4071b14 100644
--- a/clients/drcachesim/tracer/tracer.cpp
+++ b/clients/drcachesim/tracer/tracer.cpp
@@ -843,8 +843,11 @@ instrument_clean_call(void *drcontext, instrlist_t *ilist, instr_t *where,
reg_id_t reg_tmp =
insert_conditional_skip(drcontext, ilist, where, reg_ptr, skip_call,
short_reaches, app_regs_at_skip_call);
- dr_insert_clean_call_ex(drcontext, ilist, where, (void *)clean_call,
- DR_CLEANCALL_ALWAYS_OUT_OF_LINE, 0);
+ dr_insert_clean_call_ex(
+ drcontext, ilist, where, (void *)clean_call,
+ static_cast<dr_cleancall_save_t>(DR_CLEANCALL_READS_APP_CONTEXT |
+ DR_CLEANCALL_ALWAYS_OUT_OF_LINE),
+ 0);
insert_conditional_skip_target(drcontext, ilist, where, skip_call, reg_tmp,
app_regs_at_skip_call);
insert_conditional_skip_target(drcontext, ilist, where, skip_thread, reg_thread,
It looks like the generated code is bigger now due to the additional gpr restore code and the jnle instruction is using an 8-bit encoding and no longer able to address the displacement. Do we need to rewrite the instruction with a different jnle variant that can handle this displacement? Have we run into a similar problem before?
No, there won't be any problems with application branches. It is a tool-inserted short jump. You can see it here:
+240 m4 @0x00007fb179b75728 e3 fe jrcxz @0x00007fb179b6f7c8[8byte] %rcx
...
+403 m4 @0x00007fb179b6f7c8 <label>
Probably that is the tracer skipping the clean call unless the buffer is full.
Xref #56 but that requires flags.
Thanks! I have set short_reaches to false for now. Now I am running into cases where some GPRs appear to be clobbered.
For example, the value in rdx here is clearly bad.
before instrumentation:
TAG 0x00007fbbcc372f4b
+0 L3 @0x00007fb988452470 48 89 14 c6 mov %rdx -> (%rsi,%rax,8)[8byte]
+4 L3 @0x00007fb98844e980 48 83 c2 10 add $0x0000000000000010 %rdx -> %rdx
+8 L3 @0x00007fb9884529d8 48 8b 02 mov (%rdx)[8byte] -> %rax
+11 L3 @0x00007fb9884514c8 48 85 c0 test %rax %rax
+14 L3 @0x00007fb98844e818 74 35 jz $0x00007fbbcc372f90
END 0x00007fbbcc372f4b
drreg_restore_app_value @0.0x00007fbbcc372f59: reg rax already native
drreg_restore_app_value @0.0x00007fbbcc372f59: reg rdx already native
..
drreg_restore_app_value @0.0x00007fbbcc372f59: reg rsi already native
priv_mcontext_t @0x00007fb988424080
xax = 0x000000000000000e
xbx = 0x000000006ffffeff
xcx = 0x000000037ffffb70
xdx = 0x2004000800001f5b
opnd_compute_address for: (%rdx)
base => 0x2004000800001f6b
index,scale => 0x2004000800001f6b
disp => 0x2004000800001f6b
For SIGSEGV at cache pc 0x00007fbb886015d2, computed target read 0x2004000800001f6b
faulting instr: mov (%rdx)[8byte] -> %rax
** Received SIGSEGV at cache pc 0x00007fbb886015d2 in thread 55106
Looking back a few blocks, rdx was native here
before instrumentation:
TAG 0x00007fbbcc372ea0
+0 L3 @0x00007fb9884504e8 55 push %rbp %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
+1 L3 @0x00007fb988450618 48 89 e5 mov %rsp -> %rbp
+4 L3 @0x00007fb988450550 41 57 push %r15 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
+6 L3 @0x00007fb988450468 41 56 push %r14 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
+8 L3 @0x00007fb98844dc20 41 55 push %r13 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
+10 L3 @0x00007fb98844dca0 41 54 push %r12 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
+12 L3 @0x00007fb98844e6d0 49 89 fc mov %rdi -> %r12
+15 L3 @0x00007fb98844eb20 53 push %rbx %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
+16 L3 @0x00007fb98844e0d0 48 83 ec 48 sub $0x0000000000000048 %rsp -> %rsp
+20 L3 @0x00007fb98844e900 0f 31 rdtsc -> %edx %eax
+22 L3 @0x00007fb9884503e8 48 c1 e2 20 shl $0x0000000000000020 %rdx -> %rdx
+26 L3 @0x00007fb988450368 89 c0 mov %eax -> %eax
+28 L3 @0x00007fb988450300 48 09 c2 or %rax %rdx -> %rdx
+31 L3 @0x00007fb98844ebb8 48 8b 05 a2 7f 22 00 mov <rel> 0x00007fbbcc59ae68[8byte] -> %rax
+38 L3 @0x00007fb98844ea50 48 89 15 53 78 22 00 mov %rdx -> <rel> 0x00007fbbcc59a720[8byte]
+45 L3 @0x00007fb98844e448 48 8d 15 94 7f 22 00 lea <rel> 0x00007fbbcc59ae68 -> %rdx
+52 L3 @0x00007fb98844e2c8 49 89 d6 mov %rdx -> %r14
+55 L3 @0x00007fb98844e9e8 4c 2b 35 22 81 22 00 sub <rel> 0x00007fbbcc59b000[8byte] %r14 -> %r14
+62 L3 @0x00007fb98844e980 48 85 c0 test %rax %rax
+65 L3 @0x00007fb98844e818 48 89 15 18 8b 22 00 mov %rdx -> <rel> 0x00007fbbcc59ba00[8byte]
+72 L3 @0x00007fb98844da20 4c 89 35 01 8b 22 00 mov %r14 -> <rel> 0x00007fbbcc59b9f0[8byte]
+79 L3 @0x00007fb98844e518 0f 84 9b 00 00 00 jz $0x00007fbbcc372f90
END 0x00007fbbcc372ea0
after instrumentation:
...
+274 L3 @0x00007fb98844e818 48 89 15 18 8b 22 00 mov %rdx -> <rel> 0x00007fbbcc59ba00[8byte]
...
END 0x00007fbbcc372ea0
and the next block doesn't touch it
before instrumentation:
TAG 0x00007fbbcc372ef5
+0 L3 @0x00007fb988452670 48 8d 3d 64 81 22 00 lea <rel> 0x00007fbbcc59b060 -> %rdi
+7 L3 @0x00007fb98844e518 49 b9 d0 03 00 80 03 mov $0x00000003800003d0 -> %r9
00 00 00
+17 L3 @0x00007fb9884525f0 48 b9 70 fb ff 7f 03 mov $0x000000037ffffb70 -> %rcx
00 00 00
+27 L3 @0x00007fb988451768 41 b8 ff ff ff 6f mov $0x6fffffff -> %r8d
+33 L3 @0x00007fb9884518b8 41 bb ff fd ff 6f mov $0x6ffffdff -> %r11d
+39 L3 @0x00007fb988451850 bb ff fe ff 6f mov $0x6ffffeff -> %ebx
+44 L3 @0x00007fb988451df8 49 01 f9 add %rdi %r9 -> %r9
+47 L3 @0x00007fb988451c90 41 ba 31 00 00 00 mov $0x00000031 -> %r10d
+53 L3 @0x00007fb988451c28 48 01 cf add %rcx %rdi -> %rdi
+56 L3 @0x00007fb988452570 48 8d 35 fc 8a 22 00 lea <rel> 0x00007fbbcc59ba30 -> %rsi
+63 L3 @0x00007fb9884524f0 eb 25 jmp $0x00007fbbcc372f5b
END 0x00007fbbcc372ef5
...
priv_mcontext_t @0x00007fb988424080
xax = 0x000000000000000e
xbx = 0x0000000000000000
xcx = 0x0000000000000000
xdx = 0x00007fbbcc59ae68
It is used by instrumentation in the next block but is restored
before instrumentation:
TAG 0x00007fbbcc372f5b
+0 L3 @0x00007fb9884524f0 48 83 f8 21 cmp %rax $0x0000000000000021
+4 L3 @0x00007fb98844e980 76 ea jbe $0x00007fbbcc372f4b
END 0x00007fbbcc372f5b
+18 m4 @0x00007fb988451920 65 48 89 14 25 e8 00 mov %rdx -> %gs:0x000000e8[8byte]
+97 m4 @0x00007fb988451b58 65 48 8b 14 25 e8 00 mov %gs:0x000000e8[8byte] -> %rdx
priv_mcontext_t @0x00007fb988424080
xax = 0x000000000000000e
xbx = 0x000000006ffffeff
xcx = 0x000000037ffffb70
xdx = 0x00007fbbcc59ae68
Full log attached: log.0.55106.html.txt
I haven't looked at the details but if there is internal instrumentation control flow be sure that all paths are parallel, typically by putting all spills before any branches and all restores after all branches. That is one potential source of problems.
Ok I see now that rdx is not restored when this jrcxz branch is taken. So ideally it should be restored after the jrcxz+jmp and before it gets to jbe?
+69 m4 @0x00007fb988451700 e3 fe jrcxz @0x00007fb98844dee8[8byte] %rcx
+74 m4 @0x00007fb98844dee8 <label>
+74 m4 @0x00007fb988452c58 e9 fb ff ff ff jmp @0x00007fb988452ad8[8byte]
+219 m4 @0x00007fb988452ad8 <label>
+219 m4 @0x00007fb98844e818 <label>
+219 m4 @0x00007fb9884514c8 <label>
+219 m4 @0x00007fb9884529d8 65 48 8b 0c 25 e0 00 mov %gs:0x000000e0[8byte] -> %rcx
00 00
+228 L3 @0x00007fb98844e980 76 ea jbe $0x00007fbbcc372f4b
Without DR_CLEANCALL_READS_APP_CONTEXT, rcx and rdx are not restored before the clean call (which makes sense), but both are restored before the last instruction in the basic block.
With DR_CLEANCALL_READS_APP_CONTEXT, rcx and rdx are restored before the clean call, but only rcx is restored before the last instruction which leaves the tool value in rdx causing a segfault in the next basic block where it is used.
Is this due to incorrect handling in the tracer or in the instrumentation api?
instrument_basic_block ******************
before instrumentation:
TAG 0x00007fbbcc372f5b
+0 L3 @0x00007fb9884524f0 48 83 f8 21 cmp %rax $0x0000000000000021
+4 L3 @0x00007fb98844e980 76 ea jbe $0x00007fbbcc372f4b
END 0x00007fbbcc372f5b
drreg_event_bb_analysis @0.0x00007fbbcc372f5f: rax=1 rcx=1 rdx=1 rbx=1 rsp=1 rbp=1 rsi=1 rdi=1 r8=1 r9=1 r10=1 r11=1 r12=1 r13=1 r14=1 r15=1 flags=287
drreg_event_bb_analysis @1.0x00007fbbcc372f5b: rax=1 rcx=1 rdx=1 rbx=1 rsp=1 rbp=1 rsi=1 rdi=1 r8=1 r9=1 r10=1 r11=1 r12=1 r13=1 r14=1 r15=1 flags=0
drreg_reserve_reg_internal @1.0x00007fbbcc372f5b: spilling rcx to slot 0
spill_reg @1.0x00007fbbcc372f5b rcx 0
mutex_delete 0 lock 0x00007fb9883e5ad0: name=dr_client_mutex(mutex)@/home/prasun/dynamorio/dynamorio/core/lib/instrument.c:3527
rank=43 owner=0 owning_dc=0x0000000000000000 contended_event=0xffffffff prev=0x0000000000000000
lock 0 0 0 0 0+2 dr_client_mutex(mutex)@/home/prasun/dynamorio/dynamorio/core/lib/instrument.c:3527
remove_process_lock 0 lock 0x00007fb9883e5ad0: name=dr_client_mutex(mutex)@/home/prasun/dynamorio/dynamorio/core/lib/instrument.c:3527
rank=43 owner=0 owning_dc=0x0000000000000000 contended_event=0xffffffff prev=0x0000000000000000
lock 0 0 0 0 0+2 dr_client_mutex(mutex)@/home/prasun/dynamorio/dynamorio/core/lib/instrument.c:3527
never acquired
drreg_reserve_reg_internal @1.0x00007fbbcc372f5b: spilling rdx to slot 1
spill_reg @1.0x00007fbbcc372f5b rdx 1
drreg_unreserve_register @1.0x00007fbbcc372f5b rdx
drreg_unreserve_register @1.0x00007fbbcc372f5b rcx
drreg_reserve_reg_internal @0.0x00007fbbcc372f5f: using un-restored rcx slot 0
drreg_reserve_reg_internal @0.0x00007fbbcc372f5f: rcx already spilled to slot 0
mutex_delete 0 lock 0x00007fb9883e5ad0: name=dr_client_mutex(mutex)@/home/prasun/dynamorio/dynamorio/core/lib/instrument.c:3527
rank=43 owner=0 owning_dc=0x0000000000000000 contended_event=0xffffffff prev=0x0000000000000000
lock 0 0 0 0 0+2 dr_client_mutex(mutex)@/home/prasun/dynamorio/dynamorio/core/lib/instrument.c:3527
remove_process_lock 0 lock 0x00007fb9883e5ad0: name=dr_client_mutex(mutex)@/home/prasun/dynamorio/dynamorio/core/lib/instrument.c:3527
rank=43 owner=0 owning_dc=0x0000000000000000 contended_event=0xffffffff prev=0x0000000000000000
lock 0 0 0 0 0+2 dr_client_mutex(mutex)@/home/prasun/dynamorio/dynamorio/core/lib/instrument.c:3527
never acquired
CLEANCALL: insert clean call to 0x00007fbb883b8898
drreg_event_clean_call_insertion: restoring for cleancall to read app regs
drreg_insert_restore_all @0.0x0000000000000000: restoring rcx for app read
spill_reg @0.0x0000000000000000 rcx 2
restore_reg @0.0x0000000000000000 rcx slot=0 release=0
restore_reg @0.0x00007fbbcc372f5f rcx slot=2 release=1
drreg_insert_restore_all @0.0x0000000000000000: lazily restoring rdx
drreg_restore_reg_now @0.0x0000000000000000: restoring rdx
restore_reg @0.0x0000000000000000 rdx slot=1 release=1
insert_parameter_preparation: 0 args, 6 in-reg, 0 pre-push, 0/0 stack
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rax already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rbx already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rsp already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rbp already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rsi already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rdi already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r8 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r9 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r10 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r11 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r12 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r13 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r14 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r15 already native
drreg_unreserve_register @0.0x00007fbbcc372f5f rcx
drreg_insert_restore_all @0.0x00007fbbcc372f5f: lazily restoring rcx
drreg_restore_reg_now @0.0x00007fbbcc372f5f: restoring rcx
restore_reg @0.0x00007fbbcc372f5f rcx slot=0 release=1
after instrumentation:
TAG 0x00007fbbcc372f5b
+0 m4 @0x00007fb988451ff8 65 48 89 0c 25 e0 00 mov %rcx -> %gs:0x000000e0[8byte]
00 00
+9 m4 @0x00007fb988451ef8 65 48 8b 0c 25 20 01 mov %gs:0x00000120[8byte] -> %rcx
00 00
+18 m4 @0x00007fb988451920 65 48 89 14 25 e8 00 mov %rdx -> %gs:0x000000e8[8byte] # spill rdx
00 00
+27 m4 @0x00007fb988451a70 48 ba 5b 1f 00 00 08 mov $0x2004000800001f5b -> %rdx # populate OFFLINE_TYPE_PC
00 04 20
+37 m4 @0x00007fb98844dd98 48 89 11 mov %rdx -> (%rcx)[8byte] # save to buffer
+40 m4 @0x00007fb98844da20 <label>
+40 m4 @0x00007fb988451530 48 8d 49 08 lea 0x08(%rcx) -> %rcx
+44 m4 @0x00007fb9884515b0 65 48 89 0c 25 20 01 mov %rcx -> %gs:0x00000120[8byte]
00 00
+53 m4 @0x00007fb988451e78 <label>
+53 L3 @0x00007fb9884524f0 48 83 f8 21 cmp %rax $0x0000000000000021
+57 m4 @0x00007fb988452a58 65 48 8b 0c 25 20 01 mov %gs:0x00000120[8byte] -> %rcx
00 00
+66 m4 @0x00007fb988452b58 48 8b 09 mov (%rcx)[8byte] -> %rcx
+69 m4 @0x00007fb988451700 e3 fe jrcxz @0x00007fb98844dee8[8byte] %rcx # jumps to end without restoring rdx
+71 m4 @0x00007fb988451618 eb fe jmp @0x00007fb988451680[8byte]
+73 m4 @0x00007fb988452bd8 90 nop
+74 m4 @0x00007fb98844dee8 <label>
+74 m4 @0x00007fb988452c58 e9 fb ff ff ff jmp @0x00007fb988452ad8[8byte]
+79 m4 @0x00007fb988451680 <label>
+79 m4 @0x00007fb988451a08 65 48 89 0c 25 f0 00 mov %rcx -> %gs:0x000000f0[8byte]
00 00
+88 m4 @0x00007fb988451af0 65 48 8b 0c 25 e0 00 mov %gs:0x000000e0[8byte] -> %rcx
00 00
+97 m4 @0x00007fb988451b58 65 48 8b 14 25 e8 00 mov %gs:0x000000e8[8byte] -> %rdx # restore rdx
00 00
+106 m4 @0x00007fb9884519a0 65 48 a3 00 00 00 00 mov %rax -> %gs:0x00[8byte]
00 00 00 00
+117 m4 @0x00007fb988451bc0 65 48 a1 20 00 00 00 mov %gs:0x20[8byte] -> %rax
00 00 00 00
+128 m4 @0x00007fb988451d78 48 89 60 18 mov %rsp -> 0x18(%rax)[8byte]
+132 m4 @0x00007fb988451f78 48 8b a0 48 09 00 00 mov 0x00000948(%rax)[8byte] -> %rsp
+139 m4 @0x00007fb988452078 65 48 a1 00 00 00 00 mov %gs:0x00[8byte] -> %rax
00 00 00 00
+150 m4 @0x00007fb988452670 48 8d a4 24 f8 f6 ff lea 0xfffff6f8(%rsp) -> %rsp
ff
+158 m4 @0x00007fb988451c28 e8 0b 25 1b 00 call $0x00007fbb885fce00 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
+163 m4 @0x00007fb988451c90 <label>
+163 m4 @0x00007fb988451df8 e8 a3 df f6 ff call $0x00007fbb883b8898 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
+168 m4 @0x00007fb988451850 e8 0b 26 1b 00 call $0x00007fbb885fcf00 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
+173 m4 @0x00007fb9884518b8 65 48 a3 00 00 00 00 mov %rax -> %gs:0x00[8byte]
00 00 00 00
+184 m4 @0x00007fb988451768 65 48 a1 20 00 00 00 mov %gs:0x20[8byte] -> %rax
00 00 00 00
+195 m4 @0x00007fb9884525f0 48 8b 60 18 mov 0x18(%rax)[8byte] -> %rsp
+199 m4 @0x00007fb98844e518 65 48 a1 00 00 00 00 mov %gs:0x00[8byte] -> %rax
00 00 00 00
+210 m4 @0x00007fb988452cd8 <label>
+210 m4 @0x00007fb9884517d0 <label>
+210 m4 @0x00007fb988451cf8 65 48 8b 0c 25 f0 00 mov %gs:0x000000f0[8byte] -> %rcx
00 00
+219 m4 @0x00007fb988452ad8 <label>
+219 m4 @0x00007fb98844e818 <label>
+219 m4 @0x00007fb9884514c8 <label>
+219 m4 @0x00007fb9884529d8 65 48 8b 0c 25 e0 00 mov %gs:0x000000e0[8byte] -> %rcx
00 00
+228 L3 @0x00007fb98844e980 76 ea jbe $0x00007fbbcc372f4b
END 0x00007fbbcc372f5b
Without DR_CLEANCALL_READS_APP_CONTEXT
after instrumentation:
TAG 0x00007f8c480a7f5b
+0 m4 @0x00007f8a04186e78 65 48 89 0c 25 e0 00 mov %rcx -> %gs:0x000000e0[8byte]
00 00
+9 m4 @0x00007f8a04186d78 65 48 8b 0c 25 20 01 mov %gs:0x00000120[8byte] -> %rcx
00 00
+18 m4 @0x00007f8a041867d0 65 48 89 14 25 e8 00 mov %rdx -> %gs:0x000000e8[8byte]
00 00
+27 m4 @0x00007f8a04182fd0 48 ba 5b 1f 00 00 08 mov $0x2004000800001f5b -> %rdx
00 04 20
+37 m4 @0x00007f8a04186530 48 89 11 mov %rdx -> (%rcx)[8byte]
+40 m4 @0x00007f8a04187bd8 <label>
+40 m4 @0x00007f8a041865b0 48 8d 49 08 lea 0x08(%rcx) -> %rcx
+44 m4 @0x00007f8a041864c8 65 48 89 0c 25 20 01 mov %rcx -> %gs:0x00000120[8byte]
00 00
+53 m4 @0x00007f8a04186cf8 <label>
+53 L3 @0x00007f8a04187370 48 83 f8 21 cmp %rax $0x0000000000000021
+57 m4 @0x00007f8a041879d8 65 48 8b 0c 25 20 01 mov %gs:0x00000120[8byte] -> %rcx
00 00
+66 m4 @0x00007f8a04183750 48 8b 09 mov (%rcx)[8byte] -> %rcx
+69 m4 @0x00007f8a04186618 e3 fe jrcxz @0x00007f8a04186680[8byte] %rcx
+71 m4 @0x00007f8a04187ad8 eb fe jmp @0x00007f8a04186700[8byte]
+73 m4 @0x00007f8a04187b58 90 nop
+74 m4 @0x00007f8a04186680 <label>
+74 m4 @0x00007f8a04183ab8 e9 fb ff ff ff jmp @0x00007f8a04183668[8byte]
+79 m4 @0x00007f8a04186700 <label>
+79 m4 @0x00007f8a04186bc0 65 48 a3 00 00 00 00 mov %rax -> %gs:0x00[8byte]
00 00 00 00
+90 m4 @0x00007f8a04186a70 65 48 a1 20 00 00 00 mov %gs:0x20[8byte] -> %rax
00 00 00 00
+101 m4 @0x00007f8a04186af0 48 89 60 18 mov %rsp -> 0x18(%rax)[8byte]
+105 m4 @0x00007f8a04186b58 48 8b a0 48 09 00 00 mov 0x00000948(%rax)[8byte] -> %rsp
+112 m4 @0x00007f8a04186a08 65 48 a1 00 00 00 00 mov %gs:0x00[8byte] -> %rax
00 00 00 00
+123 m4 @0x00007f8a04186df8 48 8d a4 24 f8 f6 ff lea 0xfffff6f8(%rsp) -> %rsp
ff
+131 m4 @0x00007f8a04186ef8 e8 0b 25 1b 00 call $0x00007f8c04331e00 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
+136 m4 @0x00007f8a04187570 <label>
+136 m4 @0x00007f8a04187470 48 bf 5f 7f 0a 48 8c mov $0x00007f8c480a7f5f -> %rdi
7f 00 00
+146 m4 @0x00007f8a04186c28 e8 53 e0 f6 ff call $0x00007f8c040ed948 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
+151 m4 @0x00007f8a04186c90 e8 0b 26 1b 00 call $0x00007f8c04331f00 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
+156 m4 @0x00007f8a04186850 65 48 a3 00 00 00 00 mov %rax -> %gs:0x00[8byte]
00 00 00 00
+167 m4 @0x00007f8a041868b8 65 48 a1 20 00 00 00 mov %gs:0x20[8byte] -> %rax
00 00 00 00
+178 m4 @0x00007f8a04186768 48 8b 60 18 mov 0x18(%rax)[8byte] -> %rsp
+182 m4 @0x00007f8a041874f0 65 48 a1 00 00 00 00 mov %gs:0x00[8byte] -> %rax
00 00 00 00
+193 m4 @0x00007f8a041869a0 <label>
+193 m4 @0x00007f8a04186920 <label>
+193 m4 @0x00007f8a04183668 <label>
+193 m4 @0x00007f8a04187a58 <label>
+193 m4 @0x00007f8a04187958 <label>
+193 m4 @0x00007f8a04183b20 65 48 8b 0c 25 e0 00 mov %gs:0x000000e0[8byte] -> %rcx
00 00
+202 m4 @0x00007f8a041878d8 65 48 8b 14 25 e8 00 mov %gs:0x000000e8[8byte] -> %rdx
00 00
+211 L3 @0x00007f8a04183518 76 ea jbe $0x00007f8c480a7f4b
END 0x00007f8c480a7f5b
In the regular case, drreg_event_bb_insert_late calls drreg_insert_restore_all which restores both rcx and rdx. With DR_CLEANCALL_READS_APP_CONTEXT, drreg_event_clean_call_insertion has already restored rdx, so drreg_event_bb_insert_late skips this reg.
All of this happens inside drreg code which is not aware of the control flow inserted by the tracer and the tracer is not aware of which registers were marked native by dr_insert_clean_call_ex. So how should we handle this?
Regular case:
CLEANCALL: insert clean call to 0x00007f8c040ed948
insert_parameter_preparation: 1 args, 6 in-reg, 0 pre-push, 0/0 stack
in insert_conditional_skip_target: reg: rax
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg rax already native
in insert_conditional_skip_target: reg: rbx
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg rbx already native
in insert_conditional_skip_target: reg: rsp
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg rsp already native
in insert_conditional_skip_target: reg: rbp
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg rbp already native
in insert_conditional_skip_target: reg: rsi
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg rsi already native
in insert_conditional_skip_target: reg: rdi
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg rdi already native
in insert_conditional_skip_target: reg: r8
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg r8 already native
in insert_conditional_skip_target: reg: r9
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg r9 already native
in insert_conditional_skip_target: reg: r10
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg r10 already native
in insert_conditional_skip_target: reg: r11
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg r11 already native
in insert_conditional_skip_target: reg: r12
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg r12 already native
in insert_conditional_skip_target: reg: r13
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg r13 already native
in insert_conditional_skip_target: reg: r14
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg r14 already native
in insert_conditional_skip_target: reg: r15
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg r15 already native
drreg_unreserve_register @0.0x00007f8c480a7f5f rcx
drreg_insert_restore_all @0.0x00007f8c480a7f5f: lazily restoring rcx
drreg_restore_reg_now @0.0x00007f8c480a7f5f: restoring rcx
restore_reg @0.0x00007f8c480a7f5f rcx slot=0 release=1
drreg_insert_restore_all @0.0x00007f8c480a7f5f: lazily restoring rdx
drreg_restore_reg_now @0.0x00007f8c480a7f5f: restoring rdx
restore_reg @0.0x00007f8c480a7f5f rdx slot=1 release=1
With DR_CLEANCALL_READS_APP_CONTEXT
CLEANCALL: insert clean call to 0x00007fbb883b8898
drreg_event_clean_call_insertion: restoring for cleancall to read app regs
drreg_insert_restore_all @0.0x0000000000000000: restoring rcx for app read
spill_reg @0.0x0000000000000000 rcx 2
restore_reg @0.0x0000000000000000 rcx slot=0 release=0
restore_reg @0.0x00007fbbcc372f5f rcx slot=2 release=1
drreg_insert_restore_all @0.0x0000000000000000: lazily restoring rdx |
drreg_restore_reg_now @0.0x0000000000000000: restoring rdx |---> rdx restored by drreg_event_clean_call_insertion
restore_reg @0.0x0000000000000000 rdx slot=1 release=1 |
insert_parameter_preparation: 0 args, 6 in-reg, 0 pre-push, 0/0 stack
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rax already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rbx already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rsp already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rbp already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rsi already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rdi already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r8 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r9 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r10 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r11 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r12 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r13 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r14 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r15 already native
drreg_unreserve_register @0.0x00007fbbcc372f5f rcx
drreg_insert_restore_all @0.0x00007fbbcc372f5f: lazily restoring rcx
drreg_restore_reg_now @0.0x00007fbbcc372f5f: restoring rcx
restore_reg @0.0x00007fbbcc372f5f rcx slot=0 release=1
Looks like I was missing DR_CLEANCALL_MULTIPATH
I am seeing an assert in translate.c where it hits ASSERT_NOT_REACHED in recreate_app_state_from_ilist . This happens when we call dr_flush_region_ex . The application is a slightly modified version of threadsig (increased data footprint, no signals). I have run with 4 and 8 threads and haven't tried other thread counts.
Any idea what this assert means?
Thread 2 "threadsig2" hit Breakpoint 1, recreate_app_state_from_ilist (tdcontext=0x7ffdb3e401c0, ilist=0x7ffdb4081be0,
start_app=0x7ffff6275ad0 "\017\030\216@\376\377\377\017\030\216\200\375\377\377f\017oF\360f\017oN\340f\017oV\320f\017o^\300f\017of\260f\017on\240f\017ov\220f\017o~\200H\215v\200H\201", <incomplete sequence \352\200>, start_cache=0x7fffb3fc4019 "eH\211\f", <incomplete sequence \340>,
end_cache=0x7fffb3fc480d "\314\314\314\070Y\353\263\375\177", mc=0x7ffdb3f1d020, just_pc=false, flags=16778806) at /home/prasun/dynamorio/core/translate.c:1117
1117 ASSERT_NOT_REACHED();
(gdb) where
#0 recreate_app_state_from_ilist (tdcontext=0x7ffdb3e401c0, ilist=0x7ffdb4081be0,
start_app=0x7ffff6275ad0 "\017\030\216@\376\377\377\017\030\216\200\375\377\377f\017oF\360f\017oN\340f\017oV\320f\017o^\300f\017of\260f\017on\240f\017ov\220f\017o~\200H\215v\200H\201", <incomplete sequence \352\200>, start_cache=0x7fffb3fc4019 "eH\211\f", <incomplete sequence \340>,
end_cache=0x7fffb3fc480d "\314\314\314\070Y\353\263\375\177", mc=0x7ffdb3f1d020, just_pc=false, flags=16778806) at /home/prasun/dynamorio/core/translate.c:1117
#1 0x00000000712064d6 in recreate_app_state_internal (tdcontext=0x7ffdb3e401c0, mcontext=0x7ffdb3f1d020, just_pc=false, owning_f=0x0, restore_memory=false) at /home/prasun/dynamorio/core/translate.c:1480
#2 0x0000000071206aa4 in recreate_app_state (tdcontext=0x7ffdb3e401c0, mcontext=0x7ffdb3f1d020, restore_memory=false, f=0x0) at /home/prasun/dynamorio/core/translate.c:1623
#3 0x00000000711dcfc1 in at_safe_spot (trec=0x7ffdb3e40dc8, mc=0x7ffdb3f1d020, desired_state=THREAD_SYNCH_SUSPENDED_VALID_MCONTEXT_OR_NO_XFER) at /home/prasun/dynamorio/core/synch.c:613
#4 0x00000000711de4b2 in synch_with_thread (id=152173, block=false, hold_initexit_lock=true, caller_state=THREAD_SYNCH_NONE, desired_state=THREAD_SYNCH_SUSPENDED_VALID_MCONTEXT_OR_NO_XFER, flags=2) at /home/prasun/dynamorio/core/synch.c:1031
#5 0x00000000711dfc42 in synch_with_all_threads (desired_synch_state=THREAD_SYNCH_SUSPENDED_VALID_MCONTEXT_OR_NO_XFER, threads_out=0x71650000 <flush_threads>, num_threads_out=0x71650008 <flush_num_threads>, cur_state=THREAD_SYNCH_NO_LOCKS_NO_XFER, flags=2)
at /home/prasun/dynamorio/core/synch.c:1411
#6 0x0000000071080f5e in flush_fragments_synchall_start (ignored=0x7ffdb3e3ca00, base=0x0, size=18446744073709551615, exec_invalid=false) at /home/prasun/dynamorio/core/fragment.c:5953
#7 0x00000000710830fe in flush_fragments_synch_unlink_priv (dcontext=0x7ffdb3e3ca00, base=0x0, size=18446744073709551615, own_initexit_lock=false, exec_invalid=false, force_synchall=true) at /home/prasun/dynamorio/core/fragment.c:6475
#8 0x0000000071084225 in flush_fragments_in_region_start (dcontext=0x7ffdb3e3ca00, base=0x0, size=18446744073709551615, own_initexit_lock=false, free_futures=false, exec_invalid=false, force_synchall=true) at /home/prasun/dynamorio/core/fragment.c:6776
#9 0x00000000710849d0 in flush_fragments_from_region (dcontext=0x7ffdb3e3ca00, base=0x0, size=18446744073709551615, force_synchall=true, flush_completion_callback=0x7fffb3d6ea75 <change_warmup_callback(void*)>, user_data=0x0)
at /home/prasun/dynamorio/core/fragment.c:6853
#10 0x00000000711fd774 in dr_flush_region_ex (start=0x0, size=18446744073709551615, flush_completion_callback=0x7fffb3d6ea75 <change_warmup_callback(void*)>, user_data=0x0) at /home/prasun/dynamorio/core/lib/instrument.c:6797
#11 0x00007fffb3d6ee18 in hit_warmup_threshold (next_pc=0x7ffff6275b3d "s\221H\203\372\300H\215\222\200") at /home/prasun/dynamorio/clients/drcachesim/tracer/tracer.cpp:1931
#12 0x00007fffb3d6af07 in memtrace (drcontext=0x7ffdb3e3ca00, skip_size_cap=false, next_pc=0x7ffff6275b3d "s\221H\203\372\300H\215\222\200") at /home/prasun/dynamorio/clients/drcachesim/tracer/tracer.cpp:763
#13 0x00007fffb3d6b1fb in clean_call (next_pc=0x7ffff6275b3d "s\221H\203\372\300H\215\222\200") at /home/prasun/dynamorio/clients/drcachesim/tracer/tracer.cpp:851
#14 0x00007fffb3fc47a4 in ?? ()
#15 0x00007ffff546d000 in ?? ()
#16 0x00007ffff54ee000 in ?? ()
#17 0x00007ffff5d69ef0 in ?? ()
#18 0x00007ffdb3f1e720 in ?? ()
#19 0x0000000000000000 in ?? ()
Translation failure. If it is reproducible, I would run with logging and find the instruction that fails to translate. Is it DR mangling and a case is missing in the translation code? Is it tool instrumentation?
It doesn't look like it is failing to translate a specific instruction. I think it is unable to find the mcontext->pc in the ilist.
I have attached the log: translate-failure.txt .
I was clearing op_L0_filter before calling dr_flush_region_ex. A fragment was getting re-translated on another thread while dr_flush_region_ex was in progress. Therefore the fragment length at fcache_fragment_pclookup was inconsistent with the fragment definition at recreate_app_state_from_ilist.
Clearing op_L0_filter after dr_flush_region_ex fixed the issue since the regenerated fragment is identical to the earlier fragment.
I now have some basic runs start going through. I can start sending some PRs if the approach below looks fine. It is not perfect but we could add this now and improve it gradually.
- only offline traces are considered for now
- Added an option named
L0_warmup_refs - added a separate suffix for warmup trace
.warmup.rawinstead of.raw - seeing some crashes with
exit_after_tracingso not allowing it withL0_warmup_refsfor now - Added a message to indicate when
max_trace_sizeis reached (printed only once) - ignore size_cap in warmup mode since that applies only to instr trace
- first thread to reach warmup threshold
- writes footer and closes warmup trace
- opens inst trace and writes header
- sets flag indicating warmup completed and code cache flush started
- initiates code cache flush
- in the cache flush completion callback
- clear op_L0_filter
- clear memref_needs_full_info_
- disables warmup instrumentation
- enables inst trace instrumentation
- sets flag indicating cache flush done
- since switch to inst trace occurs in the middle of a basic block, while writing the buffer to disk we need to drop entries until we get to a PC record
- other threads drop their buffers if they see code cache flush in progress
- rare case where we started writing to a buffer in warmup mode and code cache flush was completed by by the time we got to the clean call to write it to disk
- seen only for the main thread which is not doing much work and contains only header and footer
- added handling to avoid post-processing failures
- updated
append_unit_headerto look at the current value ofop_L0_filter; instead it should look at how much space was reserved - could drop such buffers also
- updated
- could add more detailed handling later if necessary
- Added PC as an argument to clean_call so that we only do code cache flush if we have the PC which needs to be supplied to dr_redirect_execution in mcontext
- disabled short_reaches for now as it was causing the
encode_ctiassert with warmup instrumentation- will need a better approach long term; we can add a flag to control this for now
My main question here would be: should this feature use drbbdup for the two phases, rather than flushing?
We are currently implementing #3995 using the drbbdup library to provide multiple instrumentation cases dispatched depending on the current phase: count instructions, or produce a trace. This changes the current delayed tracing feature to no longer use a flush and use drbbdup instead.
Some of the problems are similar between the two:
- How to flush other thread's buffers at the phase transition
- How online traces handle phases
other threads drop their buffers if they see code cache flush in progress
How do other threads know a flush is in progress? Do they throw away the data and so there is a gap between the warmup and the full phase?
We are currently discussing how best to handle this for #3995. Will update that issue with the current plan.
rare case where we started writing to a buffer in warmup mode and code cache flush was completed by by the time we got to the clean call to write it to disk seen only for the main thread which is not doing much work and contains only header and footer
Not sure I understand this: I thought every thread's half-full buffer was discarded? Do you mean a thread might have never written its first buffer-full and so have empty output?
My main question here would be: should this feature use drbbdup for the two phases, rather than flushing?
Yes I had considered that initially (https://github.com/DynamoRIO/dynamorio/issues/3995#issuecomment-825748234) and we had discussed it also (https://groups.google.com/g/dynamorio-users/c/fn6LBUxTd2o/m/98l0bK1BAQAJ). At the time we decided that flushing was okay for now since there was some work needed for bbdup. Sounds like it is much better shape now?
How do other threads know a flush is in progress? Do they throw away the data and so there is a gap between the warmup and the full phase?
They check a flag when they are in the memtrace clean call. The flag is set by the thread that initiated the switch. Yes they throw away the data and there is a gap between the two phases.
Not sure I understand this: I thought every thread's half-full buffer was discarded? Do you mean a thread might have never written its first buffer-full and so have empty output?
In this case the thread was mostly idle, so the buffer had a header and discarding it completely would create an incorrectly formatted trace.
PR #5393 added drbbdup to drmemtrace for delayed tracing; PR #5402 is adding alternating trace-notrace windows. Per https://github.com/DynamoRIO/dynamorio/pull/5393#discussion_r820264882 the plan is to then refactor tracer.cpp to split into more manageable pieces, if you were interested in switching to a drbbdup-based scheme.
I made a change to introduce an 'L0_filter' mode so that we can switch from warmup trace to the full trace. I am only creating this if flag need_l0_filter_mode is set. I will set it in the proposed `L0_warmup' mode. How does this sound?
I will take a look at the code.
The drbbdup modes are very convenient but note that there is still some missing stability work in getting drbbdup scaled up: there is one known stability issue #5686 for scatter/gather plus at least one more similar one not yet figured out. Has drbbdup worked for your workload targets or have you also seen stability issues?
@prasun3 In your design, the final trace will be a combination of the warmup trace and full trace. Would you also want to split the two parts into two separate traces? Or would you continue to have a single combined trace?
@abhinav92003 In my first implementation I was flushing the code cache. I am redoing the code changes to use drbbdup. In the older implementation I am creating separate traces. The first thread to reach the warmup limit initiates the switch. The other threads drop any buffers that contain a mix of warmup and full trace. Due to this we have a gap in between the warmup and full trace.
Since I am re-doing this with drbbdup, I am open to either keeping it combined or split it. The problem with keeping a single file is that raw2trace gets confused between the two types of entries. I saw something about a FILTER_ENDPOINT marker in #5742. I'd be interested in learning how to insert that marker precisely in all the thread buffers.
Has drbbdup worked for your workload targets or have you also seen stability issues?
@derekbruening sorry forgot to reply to this. We haven't done much testing with this mode.
The record_filter tool implemented in #5675 allows filtering records in a stored offline trace. It has a feature to perform cache filtering till a certain timestamp is reached in the trace, which could be useful to create a warm-up trace preceding the full trace.
In #5738, I was extending the record_filter tool to also split the trace into two parts (warmup and full trace), but we decided to go a different direction where the trace reader figures out the end of the warmup region using the new FILTER_ENDPOINT marker inserted during filtering (this obviates the need to have actually separated trace files).
Since all of this filtering is done after raw2trace, we avoid some complexities related to raw2trace getting confused.
I see that your design does the filtering online. One benefit I can think of is that your design also reduces tracing overhead while collecting the warm up trace (due to writing out fewer trace entries to memory).
Ok, so in your case you have one 'full trace' raw file which is post-processed to produced a trace.gz file. Then this trace.gz file is further filtered to produce a new trace.gz file containing the following:
- filtered entries
- then a FILTER_ENDPOINT marker
- and then followed by the full trace entries
Is this correct?
Yes, by filtering online we want to reduce tracing overhead as well as disk usage which can be quite large at larger thread counts.
When we change tracing_mode inside a clean call, when do we start executing from the new instrumentation? Do we 'return' from the clean call to the new instrumentation, or does the switch happen at the next dispatch? When do the other threads do the switch? I need to understand this in order to insert the endpoint marker at the appropriate point in the trace.
How is trace_for_instrs expected to work with L0_filter - is trace_for_instrs expected to count before the filtering occurs, or after?
Here's a simple program that mostly hits in the L0 cache.
$ cat loop.c
int main()
{
for (int i = 0; i < 1000000; i++)
{
}
}
$ gcc -o loop loop.c
This output shows that it runs for around 3 million instructions which mostly hit in the cache if we run with L0_filter.
$ build/bin64/drrun -t drcachesim -simulator_type basic_counts -- ./loop 2>&1 | awk '/total/ && $1 > 0'
3160699 total (fetched) instructions
9303 total unique (fetched) instructions
356 total non-fetched instructions
2039556 total data loads
1010074 total data stores
1 total threads
39332 total scheduling markers
4 total other markers
$ build/bin64/drrun -t drcachesim -L0_filter -simulator_type basic_counts -- ./loop 2>&1 | awk '/total/ && $1 > 0'
984 total (fetched) instructions
935 total unique (fetched) instructions
2833 total data loads
637 total data stores
1 total threads
90 total scheduling markers
50 total other markers
Is trace_for_instrs expected to count before the filtering occurs, or after? If I trace for 2M instructions, then I can see that we don't hit the window limit in the L0_filter mode.
$ build/bin64/drrun -t drcachesim -trace_for_instrs 2000000 -offline -- ./loop
Hit tracing window #0 limit: disabling tracing.
done
$ build/bin64/drrun -t drcachesim -L0_filter -trace_for_instrs 2000000 -offline -- ./loop
done
I see in output.cpp that instruction counting is done using instru->get_instr_count
if (op_trace_for_instrs.get_value() > 0) {
for (mem_ref = data->buf_base + header_size; mem_ref < buf_ptr;
mem_ref += instru->sizeof_entry()) {
if (!window_changed && !hit_window_end &&
op_trace_for_instrs.get_value() > 0) {
hit_window_end =
count_traced_instrs(drcontext, instru->get_instr_count(mem_ref));
In instru_offline.cpp, I see that it counts the instruction count from the memref. This will work for the full trace but not for the filtered trace. Shouldn't we be using MEMTRACE_TLS_OFFS_ICOUNT instead of instru->get_instr_count in this case?
int
offline_instru_t::get_instr_count(byte *buf_ptr) const
{
offline_entry_t *entry = (offline_entry_t *)buf_ptr;
if (entry->addr.type != OFFLINE_TYPE_PC)
return 0;
// TODO i#3995: We should *not* count "non-fetched" instrs so we'll match
// hardware performance counters.
// Xref i#4948 and i#4915 on getting rid of "non-fetched" instrs.
return entry->pc.instr_count;
}
Is this correct?
Yes.
When we change tracing_mode inside a clean call, when do we start executing from the new instrumentation?
After the global tracing_mode flag is changed by any one thread, tracing will begin for a thread whenever it enters the next fragment. Note that the next fragment doesn't need to be new; it could be an existing fragment in the code cache. It is not required to enter dispatch for tracing the begin.
I need to understand this in order to insert the endpoint marker at the appropriate point in the trace.
It won't be simple to insert this during live tracing as the drbbdup modes do not have a notion of "first execution": we would have to add that somehow. There's no simple way to detect this in post-processing either. For separation of tracing windows we do trigger the end-of-bb clean call on a window change using a conditional-free scheme of writing a difference in TLS value or something like that (see instrument_clean_call()). We could consider something like that; maybe even using the window code directly by considering the filter-trace transition a window change, but that may bring other complexities so it may not work out.
How is trace_for_instrs expected to work with L0_filter - is trace_for_instrs expected to count before the filtering occurs, or after?
Maybe it would be expected to be like -warmup_refs and -sim_refs and so it would start counting after filtering is done? Wait -- that would be if we had a -filter_for_instrs. Is that what we ended up with? Is your question about -trace_for_instrs plus -filter_for_instrs?
For separation of tracing windows we do trigger the end-of-bb clean call on a window change using a conditional-free scheme of writing a difference in TLS value or something like that (see instrument_clean_call()). We could consider something like that;
Thanks. I'll take a look at this.
How is trace_for_instrs expected to work with L0_filter - is trace_for_instrs expected to count before the filtering occurs, or after?
Maybe it would be expected to be like -warmup_refs and -sim_refs and so it would start counting after filtering is done? Wait -- that would be if we had a -filter_for_instrs. Is that what we ended up with? Is your question about -trace_for_instrs plus -filter_for_instrs?
Yes, I moved to -filter_for_instrs. I was using -filter_for_refs earlier and for some microbenchmarks with small memory footprint we never get out of the filter phase. In this case it might be better to use filter_for_instrs if that option counts before filtering.
We might also want to get the full instruction trace at the same point with and without the warmup trace. Counting instructions before filtering would be useful for this.
I also saw this comment in append_unit_header in output.cpp. What is meant here by "aligning filtered traces with unfiltered traces"?
// Per-buffer should be sufficient as markers to align filtered traces
// with unfiltered traces, and is much lower overhead.
I also saw this comment in append_unit_header in output.cpp. What is meant here by "aligning filtered traces with unfiltered traces"?
// Per-buffer should be sufficient as markers to align filtered traces // with unfiltered traces, and is much lower overhead.
This is for having instr counts for computing MPKI to compare filtered with unfiltered via simulator metrics, for overall metrics as well as comparisons at particular points mid-run: so aligning across two separate runs in a simulator.