dynamorio icon indicating copy to clipboard operation
dynamorio copied to clipboard

seeing more than one cpu id on one output stream with MAP_TO_RECORDED_OUTPUT

Open prasun3 opened this issue 2 years ago • 13 comments

Describe the bug

See https://groups.google.com/g/DynamoRIO-Users/c/Dsf4As64r7o

In some of my tests, I am seeing more than one cpu id on one output stream with MAP_TO_RECORDED_OUTPUT.

The value returned by stream->get_output_cpuid matches the initial CPU_ID marker values, but at some point on some streams, we see other CPU_ID values as shown below.

Example:

Output #5 processing CPU #7                   (from get_stream_cpuid)
Output #5 saw CPU_ID marker: 7
Output #5 saw CPU_ID marker: 7
Output #5 saw CPU_ID marker: 7
...
Output #5 saw CPU_ID marker: 7
Output #5 saw CPU_ID marker: 7
Output #5 saw CPU_ID marker: 5

To Reproduce Steps to reproduce the behavior:

  1. Collect a trace with drrun -t drcachesim -offline
  2. Use the scheduler infrastructure with the MAP_TO_RECORDED_OUTPUT option
  3. We see multiple CPU IDs on the same output stream which is not expected

Please also answer these questions:

  • What happens when you run without any client?
    • Not applicable
  • What happens when you run with debug build ("-debug" flag to drrun/drconfig/drinject)?
    • No difference

Expected behavior With the MAP_TO_RECORDED_OUTPUT option, each output stream maps to exactly one CPU

Screenshots or Pasted Text

Versions

  • What version of DynamoRIO are you using?
    • Commit 910ccb25a66175bcba3f349cb7a43c8da7d9f7c3
  • Does the latest build from https://github.com/DynamoRIO/dynamorio/releases solve the problem?
    • No
  • What operating system version are you running on? ("Windows 10" is not sufficient: give the release number.)
    • Ubuntu 18
  • Is your application 32-bit or 64-bit?
    • 64-bit

Additional context

prasun3 avatar Oct 10 '23 06:10 prasun3

It looks like this is the culprit, this thread with zero instructions:

$ bin64/drrun -t drcachesim -simulator_type view -only_thread 85300 -indir ../src/clients/drcachesim/tests/drmemtrace.schedtest.x64
Output format:
<--record#-> <--instr#->: <---tid---> <record details>
------------------------------------------------------------
           1           0:       85300 <marker: version 6>
           2           0:       85300 <marker: filetype 0xe40>
           3           0:       85300 <marker: cache line size 64>
           4           0:       85300 <marker: chunk instruction count 10000000>
           5           0:       85300 <marker: page size 4096>
           6           0:       85300 <marker: timestamp 13341391202394706>
           7           0:       85300 <marker: tid 85300 on core 1>
           8           0:       85300 <marker: function==syscall #202>
           9           0:       85300 <marker: function return value 0x0>
          10           0:       85300 <marker: timestamp 13341391214912820>
          11           0:       85300 <marker: tid 85300 on core 0>
          12           0:       85300 <thread 85300 exited>
View tool results:
              0 : total instructions

All the scheduling is instruction-count-based so this degenerate thread has 2 entries both at instruction count 0:

$ unzip -p ../src/clients/drcachesim/tests/drmemtrace.schedtest.x64/cpu_schedule.bin.zip 0 | od -t d8 -w32
0000000                85303    13341391204829725                    0                    0
0000040                85304    13341391205074867                    0                    0
0000100                85300    13341391214912820                    0                    0
0000140                85304    13341391214943240                    0               109071
0000200

$ unzip -p ../src/clients/drcachesim/tests/drmemtrace.schedtest.x64/cpu_schedule.bin.zip 1 | od -t d8 -w32
0000000                85300    13341391202394706                    1                    0
0000040                85303    13341391206798536                    1                29985
0000100

The first question is, how did this thread end up having no instructions and still being output: I thought we had a filter to remove empty threads. That filter is based bytes written: so maybe the futex syscall exit (must have attached mid-futex as there's no entry) is what made it past the filter?

derekbruening avatar Oct 10 '23 20:10 derekbruening

Thanks for the debug! I added a comment here: https://github.com/DynamoRIO/dynamorio/pull/6356#issuecomment-1757043495

prasun3 avatar Oct 11 '23 07:10 prasun3

The second case from https://github.com/DynamoRIO/dynamorio/pull/6355#issuecomment-1759306715 is rep string:

Input    1 @ 13109305 refs,   3329113 instrs
ERROR: CPU marker 0 on core #1 differs from output stream CPU ID 1

$ for i in 0 1 12; do echo "Core #$i"; unzip -p ../src/clients/drcachesim/tests/drmemtrace.schedtest.x64/cpu_schedule.bin.zip $i | od -t d8 -w32; done
Core #0
0000000               180525    13341574296169946                    0                    0
0000040
Core #1
0000000               180525    13341574300611346                    1              3329113
0000040
Core #12
0000000               180524    13341574296003803                   12                    0
0000040

$ bin64/drrun -t drcachesim -simulator_type view -only_thread 180525 -indir ../src/clients/drcachesim/tests/drmemtrace.schedtest.x64 -skip_instrs 3329110 -sim_refs 30
Output format:
<--record#-> <--instr#->: <---tid---> <record details>
------------------------------------------------------------
    13109282     3329110:      180525 <marker: timestamp 13341574300611269>
    13109283     3329110:      180525 <marker: tid 180525 on core 0>
    13109284     3329111:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f20 48 89 c7             mov    %rax -> %rdi
    13109285     3329112:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f23 48 89 d6             mov    %rdx -> %rsi
    13109286     3329113:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f26 f3 48 a5             rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
    13109287     3329113:      180525 read         8 byte(s) @ 0x00007f25519a6ae0 by PC 0x00007f25505e5f26
    13109288     3329113:      180525 write        8 byte(s) @ 0x00007f25519a6ce0 by PC 0x00007f25505e5f26
    13109289     3329113:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f26 f3 48 a5             rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
    13109290     3329113:      180525 read         8 byte(s) @ 0x00007f25519a6ae8 by PC 0x00007f25505e5f26
    13109291     3329113:      180525 write        8 byte(s) @ 0x00007f25519a6ce8 by PC 0x00007f25505e5f26
    13109292     3329113:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f26 f3 48 a5             rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
    13109293     3329113:      180525 read         8 byte(s) @ 0x00007f25519a6af0 by PC 0x00007f25505e5f26
    13109294     3329113:      180525 write        8 byte(s) @ 0x00007f25519a6cf0 by PC 0x00007f25505e5f26
    13109295     3329113:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f26 f3 48 a5             rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
    13109296     3329113:      180525 read         8 byte(s) @ 0x00007f25519a6af8 by PC 0x00007f25505e5f26
    13109297     3329113:      180525 write        8 byte(s) @ 0x00007f25519a6cf8 by PC 0x00007f25505e5f26
    13109298     3329113:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f26 f3 48 a5             rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
    13109299     3329113:      180525 read         8 byte(s) @ 0x00007f25519a6b00 by PC 0x00007f25505e5f26
    13109300     3329113:      180525 write        8 byte(s) @ 0x00007f25519a6d00 by PC 0x00007f25505e5f26
    13109301     3329113:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f26 f3 48 a5             rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
    13109302     3329113:      180525 read         8 byte(s) @ 0x00007f25519a6b08 by PC 0x00007f25505e5f26
    13109303     3329113:      180525 write        8 byte(s) @ 0x00007f25519a6d08 by PC 0x00007f25505e5f26
    13109304     3329113:      180525 <marker: timestamp 13341574300611280>
    13109305     3329113:      180525 <marker: tid 180525 on core 0>
    13109306     3329113:      180525 <marker: timestamp 13341574300611346>
    13109307     3329113:      180525 <marker: tid 180525 on core 1>
    13109308     3329113:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f26 f3 48 a5             rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
    13109309     3329113:      180525 read         8 byte(s) @ 0x00007f25519a6b10 by PC 0x00007f25505e5f26
    13109310     3329113:      180525 write        8 byte(s) @ 0x00007f25519a6d10 by PC 0x00007f25505e5f26
    13109311     3329113:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f26 f3 48 a5             rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx

All the rep strings have the same instr count: making it not work well as a divider point! Will have to consider how to handle this.

derekbruening avatar Oct 12 '23 19:10 derekbruening

If we eliminated the ifetch record for rep string (#4948, #4915) it doesn't help: the load/store records themselves could fill the buffer up and trigger cpuid markers in the middle.

derekbruening avatar Oct 12 '23 19:10 derekbruening

Want instr ord for fast skip. So have to store both instr and record count? And have skip_instructions() take in both and no longer always align at the start of an instruction?

This affects record-replay too: though the switch point would just always be the start of the rep string loop.

derekbruening avatar Oct 12 '23 19:10 derekbruening

Theoretically the same thing could happen with a ton of markers in between two instructions.

derekbruening avatar Oct 12 '23 19:10 derekbruening

Since the rep string loop could be quite long, it doesn't seem like we can just ignore this case unfortunately (for the marker case I would say it's not worth more effort than throwing away the offending cpuid marker in the middle).

derekbruening avatar Oct 12 '23 19:10 derekbruening

@prasun3 do you have an opinion on how important this is: if real-world rep string loops are generally not very long maybe never switching in the middle of one is ok? (There are still other aspects to sort out like record-replay with time quanta if we went this route.)

derekbruening avatar Oct 12 '23 22:10 derekbruening

@prasun3 do you have an opinion on how important this is: if real-world rep string loops are generally not very long maybe never switching in the middle of one is ok? (There are still other aspects to sort out like record-replay with time quanta if we went this route.)

I think not switching in the middle of a rep should be okay.

prasun3 avatar Oct 13 '23 04:10 prasun3

@prasun3 -- another question: what is your preference for whether each iteration of a rep string loop should have a regular instruction fetch record, or not? We had decided in the past that it should not; but an alternative solution here would be to revisit that, to have a separate instruction ordinal for each iteration: either with a regular fetch or even without.

derekbruening avatar Oct 13 '23 04:10 derekbruening

@prasun3 -- another question: what is your preference for whether each iteration of a rep string loop should have a regular instruction fetch record, or not? We had decided in the past that it should not; but an alternative solution here would be to revisit that, to have a separate instruction ordinal for each iteration: either with a regular fetch or even without.

We'd prefer if each iteration did not have an instr fetch record.

prasun3 avatar Oct 13 '23 04:10 prasun3

For the rep string:

Currently: scheduler records and replays the start only. Problems:

  • Wrong cpuid in record-as-traced
  • Instr quantum won't switch mid-loop
  • Time quantum will switch mid-loop but on replay would switch at start so won't match

Proposal #A: Consider not a big deal to not switch mid-loop

  • Hide the wrong cpuid: raw2trace deletes??
  • Time quantum delays mid-loop switch??

Proposal #B: Count non-fetched instrs in all instr ordinals

  • Confusing with instr records not matching ordinal

Proposal #C: Reverse earlier decisions and insert instr record for each iter

  • Won't match PMU counts
  • Seems unpopular when users are polled (e.g., https://github.com/DynamoRIO/dynamorio/issues/6354#issuecomment-1760798891)

Proposal #D: Store both an instruction ordinal and a record ordinal everywhere we store instruction ordinals today (as-traced and serial schedule files; record-replay files; range_t region-of-interest bounds; skip_instructions API) and update the skipping code to advance to mid-instruction points (any point?)

Proposal #E: Like #D but we store both an instruction ordinal and sub-instruction "load/store ordinal". This then does not rely on having unfetched instr records (which we have proposed removing) and has more structure than an asolute record ordinal.

Proposal #F: New counter of {fetched+unfetched} instructions used by the scheduler. But skipping and zipfile chunks are based on fetched only.

We've decided that Proposal #E is the winner. This involves bumping the version for a new field in all the schedule file records (the raw2trace-written format and the record-replay format).

derekbruening avatar Oct 25 '23 17:10 derekbruening

The rep string solution may also serve to support running instruction-filtered traces or portions of traces in the scheduler with instruction quanta, skips, and replayed schedules.

derekbruening avatar Oct 25 '23 17:10 derekbruening