dynamorio icon indicating copy to clipboard operation
dynamorio copied to clipboard

Collect warmup trace (using L0_filter) followed by a full instruction trace

Open prasun3 opened this issue 4 years ago • 17 comments

Add a pointer to any prior users list discussion. https://groups.google.com/g/dynamorio-users/c/fn6LBUxTd2o/m/aevZa4ZRAAAJ

Is your feature request related to a problem? Please describe. It might be useful to have a "pre-trace" that can be used to warm up large caches before we start simulating instruction traces.

Describe the solution you'd like We can have an "L0 warmup" phase just like how we have a fast forward phase (-trace_after_instrs) currently. During this phase, the L0_filter would be active. Once we reach the specified instruction count, we would switch to the regular trace instrumentation.

An example run might look something like this: drrun -t drcachesim -trace_after_instrs 1000M -L0_warmup_refs 1M --offline -- ../app

Do you have any implementation in mind for this feature? This could be implemented with an approach similar to the 'trace_after_instrs' feature. We start off with the L0_filter instrumentation and then flush the code cache and enable regular trace instrumentation when the switching criteria is reached.

Describe alternatives you've considered We could just collect a really long trace but that might have higher tracing overhead, higher disk usage etc. We could use DrBBDup (#4134) but there are some TODOs still open for this feature.

Additional context

prasun3 avatar Nov 10 '21 11:11 prasun3

For an online trace, how are the two phases distinguished in analysis tools? Is a new instance of drcachesim spun up for the switch? Or are the trace entries just run together -- with some kind of marker in between like the offline file type marker??

The same question for an offline trace: though here it's easier to have completely separate output files. I assume that's the plan.

derekbruening avatar Nov 10 '21 15:11 derekbruening

Yes, for offline traces we will have separate files. I have been focusing on offline traces and hadn't thought about online trace. I think the marker idea sounds good -- that way the analyzer can have control on what it wants to do with data from the different phases.

prasun3 avatar Nov 11 '21 15:11 prasun3

While switching instrumentation, I need to ensure that the regs are restored. Therefore, I tried to change clean_call to use DR_CLEANCALL_READS_APP_CONTEXT. I ran into an assert DynamoRIO usage error : encode_cti error: target beyond 8-bit reach after making this change.

This is the change I made

diff --git a/clients/drcachesim/tracer/tracer.cpp b/clients/drcachesim/tracer/tracer.cpp
index c97ed93..4071b14 100644
--- a/clients/drcachesim/tracer/tracer.cpp
+++ b/clients/drcachesim/tracer/tracer.cpp
@@ -843,8 +843,11 @@ instrument_clean_call(void *drcontext, instrlist_t *ilist, instr_t *where,
     reg_id_t reg_tmp =
         insert_conditional_skip(drcontext, ilist, where, reg_ptr, skip_call,
                                 short_reaches, app_regs_at_skip_call);
-    dr_insert_clean_call_ex(drcontext, ilist, where, (void *)clean_call,
-                            DR_CLEANCALL_ALWAYS_OUT_OF_LINE, 0);
+    dr_insert_clean_call_ex(
+        drcontext, ilist, where, (void *)clean_call,
+        static_cast<dr_cleancall_save_t>(DR_CLEANCALL_READS_APP_CONTEXT |
+                                         DR_CLEANCALL_ALWAYS_OUT_OF_LINE),
+        0);
     insert_conditional_skip_target(drcontext, ilist, where, skip_call, reg_tmp,
                                    app_regs_at_skip_call);
     insert_conditional_skip_target(drcontext, ilist, where, skip_thread, reg_thread,

It looks like the generated code is bigger now due to the additional gpr restore code and the jnle instruction is using an 8-bit encoding and no longer able to address the displacement. Do we need to rewrite the instruction with a different jnle variant that can handle this displacement? Have we run into a similar problem before?

prasun3 avatar Dec 07 '21 17:12 prasun3

No, there won't be any problems with application branches. It is a tool-inserted short jump. You can see it here:

 +240  m4 @0x00007fb179b75728  e3 fe                jrcxz  @0x00007fb179b6f7c8[8byte] %rcx
...
 +403  m4 @0x00007fb179b6f7c8                       <label>

Probably that is the tracer skipping the clean call unless the buffer is full.

Xref #56 but that requires flags.

derekbruening avatar Dec 08 '21 00:12 derekbruening

Thanks! I have set short_reaches to false for now. Now I am running into cases where some GPRs appear to be clobbered.

For example, the value in rdx here is clearly bad.

before instrumentation:
TAG  0x00007fbbcc372f4b
 +0    L3 @0x00007fb988452470  48 89 14 c6          mov    %rdx -> (%rsi,%rax,8)[8byte]
 +4    L3 @0x00007fb98844e980  48 83 c2 10          add    $0x0000000000000010 %rdx -> %rdx
 +8    L3 @0x00007fb9884529d8  48 8b 02             mov    (%rdx)[8byte] -> %rax
 +11   L3 @0x00007fb9884514c8  48 85 c0             test   %rax %rax
 +14   L3 @0x00007fb98844e818  74 35                jz     $0x00007fbbcc372f90
END 0x00007fbbcc372f4b

drreg_restore_app_value @0.0x00007fbbcc372f59: reg rax already native
drreg_restore_app_value @0.0x00007fbbcc372f59: reg rdx already native
..
drreg_restore_app_value @0.0x00007fbbcc372f59: reg rsi already native

priv_mcontext_t @0x00007fb988424080
	xax = 0x000000000000000e
	xbx = 0x000000006ffffeff
	xcx = 0x000000037ffffb70
	xdx = 0x2004000800001f5b

opnd_compute_address for: (%rdx)
	base => 0x2004000800001f6b
	index,scale => 0x2004000800001f6b
	disp => 0x2004000800001f6b
For SIGSEGV at cache pc 0x00007fbb886015d2, computed target read 0x2004000800001f6b
	faulting instr: mov    (%rdx)[8byte] -> %rax
** Received SIGSEGV at cache pc 0x00007fbb886015d2 in thread 55106

Looking back a few blocks, rdx was native here

before instrumentation:
TAG  0x00007fbbcc372ea0
 +0    L3 @0x00007fb9884504e8  55                   push   %rbp %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
 +1    L3 @0x00007fb988450618  48 89 e5             mov    %rsp -> %rbp
 +4    L3 @0x00007fb988450550  41 57                push   %r15 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
 +6    L3 @0x00007fb988450468  41 56                push   %r14 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
 +8    L3 @0x00007fb98844dc20  41 55                push   %r13 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
 +10   L3 @0x00007fb98844dca0  41 54                push   %r12 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
 +12   L3 @0x00007fb98844e6d0  49 89 fc             mov    %rdi -> %r12
 +15   L3 @0x00007fb98844eb20  53                   push   %rbx %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
 +16   L3 @0x00007fb98844e0d0  48 83 ec 48          sub    $0x0000000000000048 %rsp -> %rsp
 +20   L3 @0x00007fb98844e900  0f 31                rdtsc   -> %edx %eax
 +22   L3 @0x00007fb9884503e8  48 c1 e2 20          shl    $0x0000000000000020 %rdx -> %rdx
 +26   L3 @0x00007fb988450368  89 c0                mov    %eax -> %eax
 +28   L3 @0x00007fb988450300  48 09 c2             or     %rax %rdx -> %rdx
 +31   L3 @0x00007fb98844ebb8  48 8b 05 a2 7f 22 00 mov    <rel> 0x00007fbbcc59ae68[8byte] -> %rax
 +38   L3 @0x00007fb98844ea50  48 89 15 53 78 22 00 mov    %rdx -> <rel> 0x00007fbbcc59a720[8byte]
 +45   L3 @0x00007fb98844e448  48 8d 15 94 7f 22 00 lea    <rel> 0x00007fbbcc59ae68 -> %rdx
 +52   L3 @0x00007fb98844e2c8  49 89 d6             mov    %rdx -> %r14
 +55   L3 @0x00007fb98844e9e8  4c 2b 35 22 81 22 00 sub    <rel> 0x00007fbbcc59b000[8byte] %r14 -> %r14
 +62   L3 @0x00007fb98844e980  48 85 c0             test   %rax %rax
 +65   L3 @0x00007fb98844e818  48 89 15 18 8b 22 00 mov    %rdx -> <rel> 0x00007fbbcc59ba00[8byte]
 +72   L3 @0x00007fb98844da20  4c 89 35 01 8b 22 00 mov    %r14 -> <rel> 0x00007fbbcc59b9f0[8byte]
 +79   L3 @0x00007fb98844e518  0f 84 9b 00 00 00    jz     $0x00007fbbcc372f90
END 0x00007fbbcc372ea0

after instrumentation:
...
 +274  L3 @0x00007fb98844e818  48 89 15 18 8b 22 00 mov    %rdx -> <rel> 0x00007fbbcc59ba00[8byte]
...
END 0x00007fbbcc372ea0

and the next block doesn't touch it

before instrumentation:
TAG  0x00007fbbcc372ef5
 +0    L3 @0x00007fb988452670  48 8d 3d 64 81 22 00 lea    <rel> 0x00007fbbcc59b060 -> %rdi
 +7    L3 @0x00007fb98844e518  49 b9 d0 03 00 80 03 mov    $0x00000003800003d0 -> %r9
                               00 00 00
 +17   L3 @0x00007fb9884525f0  48 b9 70 fb ff 7f 03 mov    $0x000000037ffffb70 -> %rcx
                               00 00 00
 +27   L3 @0x00007fb988451768  41 b8 ff ff ff 6f    mov    $0x6fffffff -> %r8d
 +33   L3 @0x00007fb9884518b8  41 bb ff fd ff 6f    mov    $0x6ffffdff -> %r11d
 +39   L3 @0x00007fb988451850  bb ff fe ff 6f       mov    $0x6ffffeff -> %ebx
 +44   L3 @0x00007fb988451df8  49 01 f9             add    %rdi %r9 -> %r9
 +47   L3 @0x00007fb988451c90  41 ba 31 00 00 00    mov    $0x00000031 -> %r10d
 +53   L3 @0x00007fb988451c28  48 01 cf             add    %rcx %rdi -> %rdi
 +56   L3 @0x00007fb988452570  48 8d 35 fc 8a 22 00 lea    <rel> 0x00007fbbcc59ba30 -> %rsi
 +63   L3 @0x00007fb9884524f0  eb 25                jmp    $0x00007fbbcc372f5b
END 0x00007fbbcc372ef5

...

priv_mcontext_t @0x00007fb988424080
	xax = 0x000000000000000e
	xbx = 0x0000000000000000
	xcx = 0x0000000000000000
	xdx = 0x00007fbbcc59ae68

It is used by instrumentation in the next block but is restored

before instrumentation:
TAG  0x00007fbbcc372f5b
 +0    L3 @0x00007fb9884524f0  48 83 f8 21          cmp    %rax $0x0000000000000021
 +4    L3 @0x00007fb98844e980  76 ea                jbe    $0x00007fbbcc372f4b
END 0x00007fbbcc372f5b

 +18   m4 @0x00007fb988451920  65 48 89 14 25 e8 00 mov    %rdx -> %gs:0x000000e8[8byte]

 +97   m4 @0x00007fb988451b58  65 48 8b 14 25 e8 00 mov    %gs:0x000000e8[8byte] -> %rdx

priv_mcontext_t @0x00007fb988424080
	xax = 0x000000000000000e
	xbx = 0x000000006ffffeff
	xcx = 0x000000037ffffb70
	xdx = 0x00007fbbcc59ae68

Full log attached: log.0.55106.html.txt

prasun3 avatar Dec 08 '21 07:12 prasun3

I haven't looked at the details but if there is internal instrumentation control flow be sure that all paths are parallel, typically by putting all spills before any branches and all restores after all branches. That is one potential source of problems.

derekbruening avatar Dec 09 '21 00:12 derekbruening

Ok I see now that rdx is not restored when this jrcxz branch is taken. So ideally it should be restored after the jrcxz+jmp and before it gets to jbe?

 +69   m4 @0x00007fb988451700  e3 fe                jrcxz  @0x00007fb98844dee8[8byte] %rcx

 +74   m4 @0x00007fb98844dee8                       <label>
 +74   m4 @0x00007fb988452c58  e9 fb ff ff ff       jmp    @0x00007fb988452ad8[8byte]

 +219  m4 @0x00007fb988452ad8                       <label>
 +219  m4 @0x00007fb98844e818                       <label>
 +219  m4 @0x00007fb9884514c8                       <label>
 +219  m4 @0x00007fb9884529d8  65 48 8b 0c 25 e0 00 mov    %gs:0x000000e0[8byte] -> %rcx
                               00 00
 +228  L3 @0x00007fb98844e980  76 ea                jbe    $0x00007fbbcc372f4b

Without DR_CLEANCALL_READS_APP_CONTEXT, rcx and rdx are not restored before the clean call (which makes sense), but both are restored before the last instruction in the basic block.

With DR_CLEANCALL_READS_APP_CONTEXT, rcx and rdx are restored before the clean call, but only rcx is restored before the last instruction which leaves the tool value in rdx causing a segfault in the next basic block where it is used.

Is this due to incorrect handling in the tracer or in the instrumentation api?

instrument_basic_block ******************

before instrumentation:
TAG  0x00007fbbcc372f5b
 +0    L3 @0x00007fb9884524f0  48 83 f8 21          cmp    %rax $0x0000000000000021
 +4    L3 @0x00007fb98844e980  76 ea                jbe    $0x00007fbbcc372f4b
END 0x00007fbbcc372f5b

drreg_event_bb_analysis @0.0x00007fbbcc372f5f: rax=1 rcx=1 rdx=1 rbx=1 rsp=1 rbp=1 rsi=1 rdi=1 r8=1 r9=1 r10=1 r11=1 r12=1 r13=1 r14=1 r15=1 flags=287
drreg_event_bb_analysis @1.0x00007fbbcc372f5b: rax=1 rcx=1 rdx=1 rbx=1 rsp=1 rbp=1 rsi=1 rdi=1 r8=1 r9=1 r10=1 r11=1 r12=1 r13=1 r14=1 r15=1 flags=0
drreg_reserve_reg_internal @1.0x00007fbbcc372f5b: spilling rcx to slot 0
spill_reg @1.0x00007fbbcc372f5b rcx 0
mutex_delete 0 lock 0x00007fb9883e5ad0: name=dr_client_mutex(mutex)@/home/prasun/dynamorio/dynamorio/core/lib/instrument.c:3527
rank=43 owner=0 owning_dc=0x0000000000000000 contended_event=0xffffffff prev=0x0000000000000000
lock                             0                              0                               0                              0                               0+2 dr_client_mutex(mutex)@/home/prasun/dynamorio/dynamorio/core/lib/instrument.c:3527
remove_process_lock 0 lock 0x00007fb9883e5ad0: name=dr_client_mutex(mutex)@/home/prasun/dynamorio/dynamorio/core/lib/instrument.c:3527
rank=43 owner=0 owning_dc=0x0000000000000000 contended_event=0xffffffff prev=0x0000000000000000
lock                             0                              0                               0                              0                               0+2 dr_client_mutex(mutex)@/home/prasun/dynamorio/dynamorio/core/lib/instrument.c:3527
	never acquired
drreg_reserve_reg_internal @1.0x00007fbbcc372f5b: spilling rdx to slot 1
spill_reg @1.0x00007fbbcc372f5b rdx 1
drreg_unreserve_register @1.0x00007fbbcc372f5b rdx
drreg_unreserve_register @1.0x00007fbbcc372f5b rcx
drreg_reserve_reg_internal @0.0x00007fbbcc372f5f: using un-restored rcx slot 0
drreg_reserve_reg_internal @0.0x00007fbbcc372f5f: rcx already spilled to slot 0
mutex_delete 0 lock 0x00007fb9883e5ad0: name=dr_client_mutex(mutex)@/home/prasun/dynamorio/dynamorio/core/lib/instrument.c:3527
rank=43 owner=0 owning_dc=0x0000000000000000 contended_event=0xffffffff prev=0x0000000000000000
lock                             0                              0                               0                              0                               0+2 dr_client_mutex(mutex)@/home/prasun/dynamorio/dynamorio/core/lib/instrument.c:3527
remove_process_lock 0 lock 0x00007fb9883e5ad0: name=dr_client_mutex(mutex)@/home/prasun/dynamorio/dynamorio/core/lib/instrument.c:3527
rank=43 owner=0 owning_dc=0x0000000000000000 contended_event=0xffffffff prev=0x0000000000000000
lock                             0                              0                               0                              0                               0+2 dr_client_mutex(mutex)@/home/prasun/dynamorio/dynamorio/core/lib/instrument.c:3527
	never acquired
CLEANCALL: insert clean call to 0x00007fbb883b8898
drreg_event_clean_call_insertion: restoring for cleancall to read app regs
drreg_insert_restore_all @0.0x0000000000000000: restoring rcx for app read
spill_reg @0.0x0000000000000000 rcx 2
restore_reg @0.0x0000000000000000 rcx slot=0 release=0
restore_reg @0.0x00007fbbcc372f5f rcx slot=2 release=1
drreg_insert_restore_all @0.0x0000000000000000: lazily restoring rdx
drreg_restore_reg_now @0.0x0000000000000000: restoring rdx
restore_reg @0.0x0000000000000000 rdx slot=1 release=1
insert_parameter_preparation: 0 args, 6 in-reg, 0 pre-push, 0/0 stack
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rax already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rbx already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rsp already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rbp already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rsi already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rdi already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r8 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r9 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r10 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r11 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r12 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r13 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r14 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r15 already native
drreg_unreserve_register @0.0x00007fbbcc372f5f rcx
drreg_insert_restore_all @0.0x00007fbbcc372f5f: lazily restoring rcx
drreg_restore_reg_now @0.0x00007fbbcc372f5f: restoring rcx
restore_reg @0.0x00007fbbcc372f5f rcx slot=0 release=1

after instrumentation:
TAG  0x00007fbbcc372f5b
 +0    m4 @0x00007fb988451ff8  65 48 89 0c 25 e0 00 mov    %rcx -> %gs:0x000000e0[8byte]
                               00 00
 +9    m4 @0x00007fb988451ef8  65 48 8b 0c 25 20 01 mov    %gs:0x00000120[8byte] -> %rcx
                               00 00
 +18   m4 @0x00007fb988451920  65 48 89 14 25 e8 00 mov    %rdx -> %gs:0x000000e8[8byte]                # spill rdx
                               00 00
 +27   m4 @0x00007fb988451a70  48 ba 5b 1f 00 00 08 mov    $0x2004000800001f5b -> %rdx                  # populate OFFLINE_TYPE_PC
                               00 04 20
 +37   m4 @0x00007fb98844dd98  48 89 11             mov    %rdx -> (%rcx)[8byte]                        # save to buffer
 +40   m4 @0x00007fb98844da20                       <label>
 +40   m4 @0x00007fb988451530  48 8d 49 08          lea    0x08(%rcx) -> %rcx
 +44   m4 @0x00007fb9884515b0  65 48 89 0c 25 20 01 mov    %rcx -> %gs:0x00000120[8byte]
                               00 00
 +53   m4 @0x00007fb988451e78                       <label>
 +53   L3 @0x00007fb9884524f0  48 83 f8 21          cmp    %rax $0x0000000000000021
 +57   m4 @0x00007fb988452a58  65 48 8b 0c 25 20 01 mov    %gs:0x00000120[8byte] -> %rcx
                               00 00
 +66   m4 @0x00007fb988452b58  48 8b 09             mov    (%rcx)[8byte] -> %rcx
 +69   m4 @0x00007fb988451700  e3 fe                jrcxz  @0x00007fb98844dee8[8byte] %rcx              # jumps to end without restoring rdx
 +71   m4 @0x00007fb988451618  eb fe                jmp    @0x00007fb988451680[8byte]
 +73   m4 @0x00007fb988452bd8  90                   nop
 +74   m4 @0x00007fb98844dee8                       <label>
 +74   m4 @0x00007fb988452c58  e9 fb ff ff ff       jmp    @0x00007fb988452ad8[8byte]
 +79   m4 @0x00007fb988451680                       <label>
 +79   m4 @0x00007fb988451a08  65 48 89 0c 25 f0 00 mov    %rcx -> %gs:0x000000f0[8byte]
                               00 00
 +88   m4 @0x00007fb988451af0  65 48 8b 0c 25 e0 00 mov    %gs:0x000000e0[8byte] -> %rcx
                               00 00
 +97   m4 @0x00007fb988451b58  65 48 8b 14 25 e8 00 mov    %gs:0x000000e8[8byte] -> %rdx                # restore rdx
                               00 00
 +106  m4 @0x00007fb9884519a0  65 48 a3 00 00 00 00 mov    %rax -> %gs:0x00[8byte]
                               00 00 00 00
 +117  m4 @0x00007fb988451bc0  65 48 a1 20 00 00 00 mov    %gs:0x20[8byte] -> %rax
                               00 00 00 00
 +128  m4 @0x00007fb988451d78  48 89 60 18          mov    %rsp -> 0x18(%rax)[8byte]
 +132  m4 @0x00007fb988451f78  48 8b a0 48 09 00 00 mov    0x00000948(%rax)[8byte] -> %rsp
 +139  m4 @0x00007fb988452078  65 48 a1 00 00 00 00 mov    %gs:0x00[8byte] -> %rax
                               00 00 00 00
 +150  m4 @0x00007fb988452670  48 8d a4 24 f8 f6 ff lea    0xfffff6f8(%rsp) -> %rsp
                               ff
 +158  m4 @0x00007fb988451c28  e8 0b 25 1b 00       call   $0x00007fbb885fce00 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
 +163  m4 @0x00007fb988451c90                       <label>
 +163  m4 @0x00007fb988451df8  e8 a3 df f6 ff       call   $0x00007fbb883b8898 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
 +168  m4 @0x00007fb988451850  e8 0b 26 1b 00       call   $0x00007fbb885fcf00 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
 +173  m4 @0x00007fb9884518b8  65 48 a3 00 00 00 00 mov    %rax -> %gs:0x00[8byte]
                               00 00 00 00
 +184  m4 @0x00007fb988451768  65 48 a1 20 00 00 00 mov    %gs:0x20[8byte] -> %rax
                               00 00 00 00
 +195  m4 @0x00007fb9884525f0  48 8b 60 18          mov    0x18(%rax)[8byte] -> %rsp
 +199  m4 @0x00007fb98844e518  65 48 a1 00 00 00 00 mov    %gs:0x00[8byte] -> %rax
                               00 00 00 00
 +210  m4 @0x00007fb988452cd8                       <label>
 +210  m4 @0x00007fb9884517d0                       <label>
 +210  m4 @0x00007fb988451cf8  65 48 8b 0c 25 f0 00 mov    %gs:0x000000f0[8byte] -> %rcx
                               00 00
 +219  m4 @0x00007fb988452ad8                       <label>
 +219  m4 @0x00007fb98844e818                       <label>
 +219  m4 @0x00007fb9884514c8                       <label>
 +219  m4 @0x00007fb9884529d8  65 48 8b 0c 25 e0 00 mov    %gs:0x000000e0[8byte] -> %rcx
                               00 00
 +228  L3 @0x00007fb98844e980  76 ea                jbe    $0x00007fbbcc372f4b
END 0x00007fbbcc372f5b

Without DR_CLEANCALL_READS_APP_CONTEXT

after instrumentation:
TAG  0x00007f8c480a7f5b
 +0    m4 @0x00007f8a04186e78  65 48 89 0c 25 e0 00 mov    %rcx -> %gs:0x000000e0[8byte]
                               00 00
 +9    m4 @0x00007f8a04186d78  65 48 8b 0c 25 20 01 mov    %gs:0x00000120[8byte] -> %rcx
                               00 00
 +18   m4 @0x00007f8a041867d0  65 48 89 14 25 e8 00 mov    %rdx -> %gs:0x000000e8[8byte]
                               00 00
 +27   m4 @0x00007f8a04182fd0  48 ba 5b 1f 00 00 08 mov    $0x2004000800001f5b -> %rdx
                               00 04 20
 +37   m4 @0x00007f8a04186530  48 89 11             mov    %rdx -> (%rcx)[8byte]
 +40   m4 @0x00007f8a04187bd8                       <label>
 +40   m4 @0x00007f8a041865b0  48 8d 49 08          lea    0x08(%rcx) -> %rcx
 +44   m4 @0x00007f8a041864c8  65 48 89 0c 25 20 01 mov    %rcx -> %gs:0x00000120[8byte]
                               00 00
 +53   m4 @0x00007f8a04186cf8                       <label>
 +53   L3 @0x00007f8a04187370  48 83 f8 21          cmp    %rax $0x0000000000000021
 +57   m4 @0x00007f8a041879d8  65 48 8b 0c 25 20 01 mov    %gs:0x00000120[8byte] -> %rcx
                               00 00
 +66   m4 @0x00007f8a04183750  48 8b 09             mov    (%rcx)[8byte] -> %rcx
 +69   m4 @0x00007f8a04186618  e3 fe                jrcxz  @0x00007f8a04186680[8byte] %rcx
 +71   m4 @0x00007f8a04187ad8  eb fe                jmp    @0x00007f8a04186700[8byte]
 +73   m4 @0x00007f8a04187b58  90                   nop
 +74   m4 @0x00007f8a04186680                       <label>
 +74   m4 @0x00007f8a04183ab8  e9 fb ff ff ff       jmp    @0x00007f8a04183668[8byte]
 +79   m4 @0x00007f8a04186700                       <label>
 +79   m4 @0x00007f8a04186bc0  65 48 a3 00 00 00 00 mov    %rax -> %gs:0x00[8byte]
                               00 00 00 00
 +90   m4 @0x00007f8a04186a70  65 48 a1 20 00 00 00 mov    %gs:0x20[8byte] -> %rax
                               00 00 00 00
 +101  m4 @0x00007f8a04186af0  48 89 60 18          mov    %rsp -> 0x18(%rax)[8byte]
 +105  m4 @0x00007f8a04186b58  48 8b a0 48 09 00 00 mov    0x00000948(%rax)[8byte] -> %rsp
 +112  m4 @0x00007f8a04186a08  65 48 a1 00 00 00 00 mov    %gs:0x00[8byte] -> %rax
                               00 00 00 00
 +123  m4 @0x00007f8a04186df8  48 8d a4 24 f8 f6 ff lea    0xfffff6f8(%rsp) -> %rsp
                               ff
 +131  m4 @0x00007f8a04186ef8  e8 0b 25 1b 00       call   $0x00007f8c04331e00 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
 +136  m4 @0x00007f8a04187570                       <label>
 +136  m4 @0x00007f8a04187470  48 bf 5f 7f 0a 48 8c mov    $0x00007f8c480a7f5f -> %rdi
                               7f 00 00
 +146  m4 @0x00007f8a04186c28  e8 53 e0 f6 ff       call   $0x00007f8c040ed948 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
 +151  m4 @0x00007f8a04186c90  e8 0b 26 1b 00       call   $0x00007f8c04331f00 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
 +156  m4 @0x00007f8a04186850  65 48 a3 00 00 00 00 mov    %rax -> %gs:0x00[8byte]
                               00 00 00 00
 +167  m4 @0x00007f8a041868b8  65 48 a1 20 00 00 00 mov    %gs:0x20[8byte] -> %rax
                               00 00 00 00
 +178  m4 @0x00007f8a04186768  48 8b 60 18          mov    0x18(%rax)[8byte] -> %rsp
 +182  m4 @0x00007f8a041874f0  65 48 a1 00 00 00 00 mov    %gs:0x00[8byte] -> %rax
                               00 00 00 00
 +193  m4 @0x00007f8a041869a0                       <label>
 +193  m4 @0x00007f8a04186920                       <label>
 +193  m4 @0x00007f8a04183668                       <label>
 +193  m4 @0x00007f8a04187a58                       <label>
 +193  m4 @0x00007f8a04187958                       <label>
 +193  m4 @0x00007f8a04183b20  65 48 8b 0c 25 e0 00 mov    %gs:0x000000e0[8byte] -> %rcx
                               00 00
 +202  m4 @0x00007f8a041878d8  65 48 8b 14 25 e8 00 mov    %gs:0x000000e8[8byte] -> %rdx
                               00 00
 +211  L3 @0x00007f8a04183518  76 ea                jbe    $0x00007f8c480a7f4b
END 0x00007f8c480a7f5b

prasun3 avatar Dec 14 '21 16:12 prasun3

In the regular case, drreg_event_bb_insert_late calls drreg_insert_restore_all which restores both rcx and rdx. With DR_CLEANCALL_READS_APP_CONTEXT, drreg_event_clean_call_insertion has already restored rdx, so drreg_event_bb_insert_late skips this reg.

All of this happens inside drreg code which is not aware of the control flow inserted by the tracer and the tracer is not aware of which registers were marked native by dr_insert_clean_call_ex. So how should we handle this?

Regular case:

CLEANCALL: insert clean call to 0x00007f8c040ed948
insert_parameter_preparation: 1 args, 6 in-reg, 0 pre-push, 0/0 stack
in insert_conditional_skip_target: reg: rax
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg rax already native
in insert_conditional_skip_target: reg: rbx
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg rbx already native
in insert_conditional_skip_target: reg: rsp
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg rsp already native
in insert_conditional_skip_target: reg: rbp
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg rbp already native
in insert_conditional_skip_target: reg: rsi
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg rsi already native
in insert_conditional_skip_target: reg: rdi
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg rdi already native
in insert_conditional_skip_target: reg: r8
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg r8 already native
in insert_conditional_skip_target: reg: r9
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg r9 already native
in insert_conditional_skip_target: reg: r10
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg r10 already native
in insert_conditional_skip_target: reg: r11
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg r11 already native
in insert_conditional_skip_target: reg: r12
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg r12 already native
in insert_conditional_skip_target: reg: r13
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg r13 already native
in insert_conditional_skip_target: reg: r14
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg r14 already native
in insert_conditional_skip_target: reg: r15
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg r15 already native
drreg_unreserve_register @0.0x00007f8c480a7f5f rcx
drreg_insert_restore_all @0.0x00007f8c480a7f5f: lazily restoring rcx
drreg_restore_reg_now @0.0x00007f8c480a7f5f: restoring rcx
restore_reg @0.0x00007f8c480a7f5f rcx slot=0 release=1
drreg_insert_restore_all @0.0x00007f8c480a7f5f: lazily restoring rdx
drreg_restore_reg_now @0.0x00007f8c480a7f5f: restoring rdx
restore_reg @0.0x00007f8c480a7f5f rdx slot=1 release=1

With DR_CLEANCALL_READS_APP_CONTEXT

CLEANCALL: insert clean call to 0x00007fbb883b8898
drreg_event_clean_call_insertion: restoring for cleancall to read app regs
drreg_insert_restore_all @0.0x0000000000000000: restoring rcx for app read
spill_reg @0.0x0000000000000000 rcx 2
restore_reg @0.0x0000000000000000 rcx slot=0 release=0
restore_reg @0.0x00007fbbcc372f5f rcx slot=2 release=1
drreg_insert_restore_all @0.0x0000000000000000: lazily restoring rdx                |
drreg_restore_reg_now @0.0x0000000000000000: restoring rdx                          |---> rdx restored by drreg_event_clean_call_insertion
restore_reg @0.0x0000000000000000 rdx slot=1 release=1                              |
insert_parameter_preparation: 0 args, 6 in-reg, 0 pre-push, 0/0 stack
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rax already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rbx already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rsp already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rbp already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rsi already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rdi already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r8 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r9 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r10 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r11 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r12 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r13 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r14 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r15 already native
drreg_unreserve_register @0.0x00007fbbcc372f5f rcx
drreg_insert_restore_all @0.0x00007fbbcc372f5f: lazily restoring rcx
drreg_restore_reg_now @0.0x00007fbbcc372f5f: restoring rcx
restore_reg @0.0x00007fbbcc372f5f rcx slot=0 release=1

prasun3 avatar Jan 07 '22 10:01 prasun3

Looks like I was missing DR_CLEANCALL_MULTIPATH

prasun3 avatar Jan 10 '22 16:01 prasun3

I am seeing an assert in translate.c where it hits ASSERT_NOT_REACHED in recreate_app_state_from_ilist . This happens when we call dr_flush_region_ex . The application is a slightly modified version of threadsig (increased data footprint, no signals). I have run with 4 and 8 threads and haven't tried other thread counts.

Any idea what this assert means?

Thread 2 "threadsig2" hit Breakpoint 1, recreate_app_state_from_ilist (tdcontext=0x7ffdb3e401c0, ilist=0x7ffdb4081be0,
    start_app=0x7ffff6275ad0 "\017\030\216@\376\377\377\017\030\216\200\375\377\377f\017oF\360f\017oN\340f\017oV\320f\017o^\300f\017of\260f\017on\240f\017ov\220f\017o~\200H\215v\200H\201", <incomplete sequence \352\200>, start_cache=0x7fffb3fc4019 "eH\211\f", <incomplete sequence \340>,
    end_cache=0x7fffb3fc480d "\314\314\314\070Y\353\263\375\177", mc=0x7ffdb3f1d020, just_pc=false, flags=16778806) at /home/prasun/dynamorio/core/translate.c:1117
1117        ASSERT_NOT_REACHED();
(gdb) where
#0  recreate_app_state_from_ilist (tdcontext=0x7ffdb3e401c0, ilist=0x7ffdb4081be0,
    start_app=0x7ffff6275ad0 "\017\030\216@\376\377\377\017\030\216\200\375\377\377f\017oF\360f\017oN\340f\017oV\320f\017o^\300f\017of\260f\017on\240f\017ov\220f\017o~\200H\215v\200H\201", <incomplete sequence \352\200>, start_cache=0x7fffb3fc4019 "eH\211\f", <incomplete sequence \340>,
    end_cache=0x7fffb3fc480d "\314\314\314\070Y\353\263\375\177", mc=0x7ffdb3f1d020, just_pc=false, flags=16778806) at /home/prasun/dynamorio/core/translate.c:1117
#1  0x00000000712064d6 in recreate_app_state_internal (tdcontext=0x7ffdb3e401c0, mcontext=0x7ffdb3f1d020, just_pc=false, owning_f=0x0, restore_memory=false) at /home/prasun/dynamorio/core/translate.c:1480
#2  0x0000000071206aa4 in recreate_app_state (tdcontext=0x7ffdb3e401c0, mcontext=0x7ffdb3f1d020, restore_memory=false, f=0x0) at /home/prasun/dynamorio/core/translate.c:1623
#3  0x00000000711dcfc1 in at_safe_spot (trec=0x7ffdb3e40dc8, mc=0x7ffdb3f1d020, desired_state=THREAD_SYNCH_SUSPENDED_VALID_MCONTEXT_OR_NO_XFER) at /home/prasun/dynamorio/core/synch.c:613
#4  0x00000000711de4b2 in synch_with_thread (id=152173, block=false, hold_initexit_lock=true, caller_state=THREAD_SYNCH_NONE, desired_state=THREAD_SYNCH_SUSPENDED_VALID_MCONTEXT_OR_NO_XFER, flags=2) at /home/prasun/dynamorio/core/synch.c:1031
#5  0x00000000711dfc42 in synch_with_all_threads (desired_synch_state=THREAD_SYNCH_SUSPENDED_VALID_MCONTEXT_OR_NO_XFER, threads_out=0x71650000 <flush_threads>, num_threads_out=0x71650008 <flush_num_threads>, cur_state=THREAD_SYNCH_NO_LOCKS_NO_XFER, flags=2)
    at /home/prasun/dynamorio/core/synch.c:1411
#6  0x0000000071080f5e in flush_fragments_synchall_start (ignored=0x7ffdb3e3ca00, base=0x0, size=18446744073709551615, exec_invalid=false) at /home/prasun/dynamorio/core/fragment.c:5953
#7  0x00000000710830fe in flush_fragments_synch_unlink_priv (dcontext=0x7ffdb3e3ca00, base=0x0, size=18446744073709551615, own_initexit_lock=false, exec_invalid=false, force_synchall=true) at /home/prasun/dynamorio/core/fragment.c:6475
#8  0x0000000071084225 in flush_fragments_in_region_start (dcontext=0x7ffdb3e3ca00, base=0x0, size=18446744073709551615, own_initexit_lock=false, free_futures=false, exec_invalid=false, force_synchall=true) at /home/prasun/dynamorio/core/fragment.c:6776
#9  0x00000000710849d0 in flush_fragments_from_region (dcontext=0x7ffdb3e3ca00, base=0x0, size=18446744073709551615, force_synchall=true, flush_completion_callback=0x7fffb3d6ea75 <change_warmup_callback(void*)>, user_data=0x0)
    at /home/prasun/dynamorio/core/fragment.c:6853
#10 0x00000000711fd774 in dr_flush_region_ex (start=0x0, size=18446744073709551615, flush_completion_callback=0x7fffb3d6ea75 <change_warmup_callback(void*)>, user_data=0x0) at /home/prasun/dynamorio/core/lib/instrument.c:6797
#11 0x00007fffb3d6ee18 in hit_warmup_threshold (next_pc=0x7ffff6275b3d "s\221H\203\372\300H\215\222\200") at /home/prasun/dynamorio/clients/drcachesim/tracer/tracer.cpp:1931
#12 0x00007fffb3d6af07 in memtrace (drcontext=0x7ffdb3e3ca00, skip_size_cap=false, next_pc=0x7ffff6275b3d "s\221H\203\372\300H\215\222\200") at /home/prasun/dynamorio/clients/drcachesim/tracer/tracer.cpp:763
#13 0x00007fffb3d6b1fb in clean_call (next_pc=0x7ffff6275b3d "s\221H\203\372\300H\215\222\200") at /home/prasun/dynamorio/clients/drcachesim/tracer/tracer.cpp:851
#14 0x00007fffb3fc47a4 in ?? ()
#15 0x00007ffff546d000 in ?? ()
#16 0x00007ffff54ee000 in ?? ()
#17 0x00007ffff5d69ef0 in ?? ()
#18 0x00007ffdb3f1e720 in ?? ()
#19 0x0000000000000000 in ?? ()

prasun3 avatar Feb 10 '22 16:02 prasun3

Translation failure. If it is reproducible, I would run with logging and find the instruction that fails to translate. Is it DR mangling and a case is missing in the translation code? Is it tool instrumentation?

derekbruening avatar Feb 10 '22 19:02 derekbruening

It doesn't look like it is failing to translate a specific instruction. I think it is unable to find the mcontext->pc in the ilist.

I have attached the log: translate-failure.txt .

prasun3 avatar Feb 11 '22 10:02 prasun3

I was clearing op_L0_filter before calling dr_flush_region_ex. A fragment was getting re-translated on another thread while dr_flush_region_ex was in progress. Therefore the fragment length at fcache_fragment_pclookup was inconsistent with the fragment definition at recreate_app_state_from_ilist.

Clearing op_L0_filter after dr_flush_region_ex fixed the issue since the regenerated fragment is identical to the earlier fragment.

prasun3 avatar Feb 22 '22 04:02 prasun3

I now have some basic runs start going through. I can start sending some PRs if the approach below looks fine. It is not perfect but we could add this now and improve it gradually.

  • only offline traces are considered for now
  • Added an option named L0_warmup_refs
  • added a separate suffix for warmup trace .warmup.raw instead of .raw
  • seeing some crashes with exit_after_tracing so not allowing it with L0_warmup_refs for now
  • Added a message to indicate when max_trace_size is reached (printed only once)
  • ignore size_cap in warmup mode since that applies only to instr trace
  • first thread to reach warmup threshold
    • writes footer and closes warmup trace
    • opens inst trace and writes header
    • sets flag indicating warmup completed and code cache flush started
    • initiates code cache flush
    • in the cache flush completion callback
      • clear op_L0_filter
      • clear memref_needs_full_info_
      • disables warmup instrumentation
      • enables inst trace instrumentation
      • sets flag indicating cache flush done
    • since switch to inst trace occurs in the middle of a basic block, while writing the buffer to disk we need to drop entries until we get to a PC record
  • other threads drop their buffers if they see code cache flush in progress
    • rare case where we started writing to a buffer in warmup mode and code cache flush was completed by by the time we got to the clean call to write it to disk
    • seen only for the main thread which is not doing much work and contains only header and footer
    • added handling to avoid post-processing failures
      • updated append_unit_header to look at the current value of op_L0_filter; instead it should look at how much space was reserved
      • could drop such buffers also
    • could add more detailed handling later if necessary
  • Added PC as an argument to clean_call so that we only do code cache flush if we have the PC which needs to be supplied to dr_redirect_execution in mcontext
  • disabled short_reaches for now as it was causing the encode_cti assert with warmup instrumentation
    • will need a better approach long term; we can add a flag to control this for now

prasun3 avatar Feb 23 '22 07:02 prasun3

My main question here would be: should this feature use drbbdup for the two phases, rather than flushing?

We are currently implementing #3995 using the drbbdup library to provide multiple instrumentation cases dispatched depending on the current phase: count instructions, or produce a trace. This changes the current delayed tracing feature to no longer use a flush and use drbbdup instead.

Some of the problems are similar between the two:

  • How to flush other thread's buffers at the phase transition
  • How online traces handle phases

other threads drop their buffers if they see code cache flush in progress

How do other threads know a flush is in progress? Do they throw away the data and so there is a gap between the warmup and the full phase?

We are currently discussing how best to handle this for #3995. Will update that issue with the current plan.

rare case where we started writing to a buffer in warmup mode and code cache flush was completed by by the time we got to the clean call to write it to disk seen only for the main thread which is not doing much work and contains only header and footer

Not sure I understand this: I thought every thread's half-full buffer was discarded? Do you mean a thread might have never written its first buffer-full and so have empty output?

derekbruening avatar Feb 23 '22 16:02 derekbruening

My main question here would be: should this feature use drbbdup for the two phases, rather than flushing?

Yes I had considered that initially (https://github.com/DynamoRIO/dynamorio/issues/3995#issuecomment-825748234) and we had discussed it also (https://groups.google.com/g/dynamorio-users/c/fn6LBUxTd2o/m/98l0bK1BAQAJ). At the time we decided that flushing was okay for now since there was some work needed for bbdup. Sounds like it is much better shape now?

How do other threads know a flush is in progress? Do they throw away the data and so there is a gap between the warmup and the full phase?

They check a flag when they are in the memtrace clean call. The flag is set by the thread that initiated the switch. Yes they throw away the data and there is a gap between the two phases.

Not sure I understand this: I thought every thread's half-full buffer was discarded? Do you mean a thread might have never written its first buffer-full and so have empty output?

In this case the thread was mostly idle, so the buffer had a header and discarding it completely would create an incorrectly formatted trace.

prasun3 avatar Feb 24 '22 04:02 prasun3

PR #5393 added drbbdup to drmemtrace for delayed tracing; PR #5402 is adding alternating trace-notrace windows. Per https://github.com/DynamoRIO/dynamorio/pull/5393#discussion_r820264882 the plan is to then refactor tracer.cpp to split into more manageable pieces, if you were interested in switching to a drbbdup-based scheme.

derekbruening avatar Mar 08 '22 14:03 derekbruening

I made a change to introduce an 'L0_filter' mode so that we can switch from warmup trace to the full trace. I am only creating this if flag need_l0_filter_mode is set. I will set it in the proposed `L0_warmup' mode. How does this sound?

prasun3 avatar Nov 08 '22 05:11 prasun3

I will take a look at the code.

The drbbdup modes are very convenient but note that there is still some missing stability work in getting drbbdup scaled up: there is one known stability issue #5686 for scatter/gather plus at least one more similar one not yet figured out. Has drbbdup worked for your workload targets or have you also seen stability issues?

derekbruening avatar Nov 08 '22 14:11 derekbruening

@prasun3 In your design, the final trace will be a combination of the warmup trace and full trace. Would you also want to split the two parts into two separate traces? Or would you continue to have a single combined trace?

abhinav92003 avatar Nov 16 '22 16:11 abhinav92003

@abhinav92003 In my first implementation I was flushing the code cache. I am redoing the code changes to use drbbdup. In the older implementation I am creating separate traces. The first thread to reach the warmup limit initiates the switch. The other threads drop any buffers that contain a mix of warmup and full trace. Due to this we have a gap in between the warmup and full trace.

Since I am re-doing this with drbbdup, I am open to either keeping it combined or split it. The problem with keeping a single file is that raw2trace gets confused between the two types of entries. I saw something about a FILTER_ENDPOINT marker in #5742. I'd be interested in learning how to insert that marker precisely in all the thread buffers.

Has drbbdup worked for your workload targets or have you also seen stability issues?

@derekbruening sorry forgot to reply to this. We haven't done much testing with this mode.

prasun3 avatar Nov 21 '22 13:11 prasun3

The record_filter tool implemented in #5675 allows filtering records in a stored offline trace. It has a feature to perform cache filtering till a certain timestamp is reached in the trace, which could be useful to create a warm-up trace preceding the full trace.

In #5738, I was extending the record_filter tool to also split the trace into two parts (warmup and full trace), but we decided to go a different direction where the trace reader figures out the end of the warmup region using the new FILTER_ENDPOINT marker inserted during filtering (this obviates the need to have actually separated trace files).

Since all of this filtering is done after raw2trace, we avoid some complexities related to raw2trace getting confused.

I see that your design does the filtering online. One benefit I can think of is that your design also reduces tracing overhead while collecting the warm up trace (due to writing out fewer trace entries to memory).

abhinav92003 avatar Dec 02 '22 23:12 abhinav92003

Ok, so in your case you have one 'full trace' raw file which is post-processed to produced a trace.gz file. Then this trace.gz file is further filtered to produce a new trace.gz file containing the following:

  1. filtered entries
  2. then a FILTER_ENDPOINT marker
  3. and then followed by the full trace entries

Is this correct?

Yes, by filtering online we want to reduce tracing overhead as well as disk usage which can be quite large at larger thread counts.

When we change tracing_mode inside a clean call, when do we start executing from the new instrumentation? Do we 'return' from the clean call to the new instrumentation, or does the switch happen at the next dispatch? When do the other threads do the switch? I need to understand this in order to insert the endpoint marker at the appropriate point in the trace.

prasun3 avatar Dec 09 '22 07:12 prasun3

How is trace_for_instrs expected to work with L0_filter - is trace_for_instrs expected to count before the filtering occurs, or after?

Here's a simple program that mostly hits in the L0 cache.

$ cat loop.c
int main()
{
    for (int i = 0; i < 1000000; i++)
    {
    }
}

$ gcc -o loop loop.c

This output shows that it runs for around 3 million instructions which mostly hit in the cache if we run with L0_filter.

$ build/bin64/drrun -t drcachesim -simulator_type basic_counts -- ./loop 2>&1 | awk '/total/ && $1 > 0'
     3160699 total (fetched) instructions
        9303 total unique (fetched) instructions
         356 total non-fetched instructions
     2039556 total data loads
     1010074 total data stores
           1 total threads
       39332 total scheduling markers
           4 total other markers

$ build/bin64/drrun -t drcachesim -L0_filter -simulator_type basic_counts -- ./loop 2>&1 | awk '/total/ && $1 > 0'
         984 total (fetched) instructions
         935 total unique (fetched) instructions
        2833 total data loads
         637 total data stores
           1 total threads
          90 total scheduling markers
          50 total other markers

Is trace_for_instrs expected to count before the filtering occurs, or after? If I trace for 2M instructions, then I can see that we don't hit the window limit in the L0_filter mode.

$ build/bin64/drrun -t drcachesim -trace_for_instrs 2000000 -offline -- ./loop
Hit tracing window #0 limit: disabling tracing.
done

$ build/bin64/drrun -t drcachesim -L0_filter -trace_for_instrs 2000000 -offline -- ./loop
done

I see in output.cpp that instruction counting is done using instru->get_instr_count

        if (op_trace_for_instrs.get_value() > 0) {
            for (mem_ref = data->buf_base + header_size; mem_ref < buf_ptr;
                 mem_ref += instru->sizeof_entry()) {
                if (!window_changed && !hit_window_end &&
                    op_trace_for_instrs.get_value() > 0) {
                    hit_window_end =
                        count_traced_instrs(drcontext, instru->get_instr_count(mem_ref));

In instru_offline.cpp, I see that it counts the instruction count from the memref. This will work for the full trace but not for the filtered trace. Shouldn't we be using MEMTRACE_TLS_OFFS_ICOUNT instead of instru->get_instr_count in this case?

int
offline_instru_t::get_instr_count(byte *buf_ptr) const
{
    offline_entry_t *entry = (offline_entry_t *)buf_ptr;
    if (entry->addr.type != OFFLINE_TYPE_PC)
        return 0;
    // TODO i#3995: We should *not* count "non-fetched" instrs so we'll match
    // hardware performance counters.
    // Xref i#4948 and i#4915 on getting rid of "non-fetched" instrs.
    return entry->pc.instr_count;
}

prasun3 avatar Dec 09 '22 07:12 prasun3

Is this correct?

Yes.

When we change tracing_mode inside a clean call, when do we start executing from the new instrumentation?

After the global tracing_mode flag is changed by any one thread, tracing will begin for a thread whenever it enters the next fragment. Note that the next fragment doesn't need to be new; it could be an existing fragment in the code cache. It is not required to enter dispatch for tracing the begin.

abhinav92003 avatar Dec 09 '22 16:12 abhinav92003

I need to understand this in order to insert the endpoint marker at the appropriate point in the trace.

It won't be simple to insert this during live tracing as the drbbdup modes do not have a notion of "first execution": we would have to add that somehow. There's no simple way to detect this in post-processing either. For separation of tracing windows we do trigger the end-of-bb clean call on a window change using a conditional-free scheme of writing a difference in TLS value or something like that (see instrument_clean_call()). We could consider something like that; maybe even using the window code directly by considering the filter-trace transition a window change, but that may bring other complexities so it may not work out.

derekbruening avatar Dec 09 '22 17:12 derekbruening

How is trace_for_instrs expected to work with L0_filter - is trace_for_instrs expected to count before the filtering occurs, or after?

Maybe it would be expected to be like -warmup_refs and -sim_refs and so it would start counting after filtering is done? Wait -- that would be if we had a -filter_for_instrs. Is that what we ended up with? Is your question about -trace_for_instrs plus -filter_for_instrs?

derekbruening avatar Dec 09 '22 17:12 derekbruening

For separation of tracing windows we do trigger the end-of-bb clean call on a window change using a conditional-free scheme of writing a difference in TLS value or something like that (see instrument_clean_call()). We could consider something like that;

Thanks. I'll take a look at this.

How is trace_for_instrs expected to work with L0_filter - is trace_for_instrs expected to count before the filtering occurs, or after?

Maybe it would be expected to be like -warmup_refs and -sim_refs and so it would start counting after filtering is done? Wait -- that would be if we had a -filter_for_instrs. Is that what we ended up with? Is your question about -trace_for_instrs plus -filter_for_instrs?

Yes, I moved to -filter_for_instrs. I was using -filter_for_refs earlier and for some microbenchmarks with small memory footprint we never get out of the filter phase. In this case it might be better to use filter_for_instrs if that option counts before filtering.

We might also want to get the full instruction trace at the same point with and without the warmup trace. Counting instructions before filtering would be useful for this.

I also saw this comment in append_unit_header in output.cpp. What is meant here by "aligning filtered traces with unfiltered traces"?

// Per-buffer should be sufficient as markers to align filtered traces
// with unfiltered traces, and is much lower overhead.

prasun3 avatar Dec 12 '22 04:12 prasun3

I also saw this comment in append_unit_header in output.cpp. What is meant here by "aligning filtered traces with unfiltered traces"?

// Per-buffer should be sufficient as markers to align filtered traces
// with unfiltered traces, and is much lower overhead.

This is for having instr counts for computing MPKI to compare filtered with unfiltered via simulator metrics, for overall metrics as well as comparisons at particular points mid-run: so aligning across two separate runs in a simulator.

derekbruening avatar Dec 12 '22 18:12 derekbruening