seeing more than one cpu id on one output stream with MAP_TO_RECORDED_OUTPUT
Describe the bug
See https://groups.google.com/g/DynamoRIO-Users/c/Dsf4As64r7o
In some of my tests, I am seeing more than one cpu id on one output stream with MAP_TO_RECORDED_OUTPUT.
The value returned by stream->get_output_cpuid matches the initial CPU_ID marker values, but at some point on some streams, we see other CPU_ID values as shown below.
Example:
Output #5 processing CPU #7 (from get_stream_cpuid)
Output #5 saw CPU_ID marker: 7
Output #5 saw CPU_ID marker: 7
Output #5 saw CPU_ID marker: 7
...
Output #5 saw CPU_ID marker: 7
Output #5 saw CPU_ID marker: 7
Output #5 saw CPU_ID marker: 5
To Reproduce Steps to reproduce the behavior:
- Collect a trace with
drrun -t drcachesim -offline - Use the scheduler infrastructure with the MAP_TO_RECORDED_OUTPUT option
- We see multiple CPU IDs on the same output stream which is not expected
Please also answer these questions:
- What happens when you run without any client?
- Not applicable
- What happens when you run with debug build ("-debug" flag to drrun/drconfig/drinject)?
- No difference
Expected behavior With the MAP_TO_RECORDED_OUTPUT option, each output stream maps to exactly one CPU
Screenshots or Pasted Text
Versions
- What version of DynamoRIO are you using?
- Commit 910ccb25a66175bcba3f349cb7a43c8da7d9f7c3
- Does the latest build from https://github.com/DynamoRIO/dynamorio/releases solve the problem?
- No
- What operating system version are you running on? ("Windows 10" is not sufficient: give the release number.)
- Ubuntu 18
- Is your application 32-bit or 64-bit?
- 64-bit
Additional context
It looks like this is the culprit, this thread with zero instructions:
$ bin64/drrun -t drcachesim -simulator_type view -only_thread 85300 -indir ../src/clients/drcachesim/tests/drmemtrace.schedtest.x64
Output format:
<--record#-> <--instr#->: <---tid---> <record details>
------------------------------------------------------------
1 0: 85300 <marker: version 6>
2 0: 85300 <marker: filetype 0xe40>
3 0: 85300 <marker: cache line size 64>
4 0: 85300 <marker: chunk instruction count 10000000>
5 0: 85300 <marker: page size 4096>
6 0: 85300 <marker: timestamp 13341391202394706>
7 0: 85300 <marker: tid 85300 on core 1>
8 0: 85300 <marker: function==syscall #202>
9 0: 85300 <marker: function return value 0x0>
10 0: 85300 <marker: timestamp 13341391214912820>
11 0: 85300 <marker: tid 85300 on core 0>
12 0: 85300 <thread 85300 exited>
View tool results:
0 : total instructions
All the scheduling is instruction-count-based so this degenerate thread has 2 entries both at instruction count 0:
$ unzip -p ../src/clients/drcachesim/tests/drmemtrace.schedtest.x64/cpu_schedule.bin.zip 0 | od -t d8 -w32
0000000 85303 13341391204829725 0 0
0000040 85304 13341391205074867 0 0
0000100 85300 13341391214912820 0 0
0000140 85304 13341391214943240 0 109071
0000200
$ unzip -p ../src/clients/drcachesim/tests/drmemtrace.schedtest.x64/cpu_schedule.bin.zip 1 | od -t d8 -w32
0000000 85300 13341391202394706 1 0
0000040 85303 13341391206798536 1 29985
0000100
The first question is, how did this thread end up having no instructions and still being output: I thought we had a filter to remove empty threads. That filter is based bytes written: so maybe the futex syscall exit (must have attached mid-futex as there's no entry) is what made it past the filter?
Thanks for the debug! I added a comment here: https://github.com/DynamoRIO/dynamorio/pull/6356#issuecomment-1757043495
The second case from https://github.com/DynamoRIO/dynamorio/pull/6355#issuecomment-1759306715 is rep string:
Input 1 @ 13109305 refs, 3329113 instrs
ERROR: CPU marker 0 on core #1 differs from output stream CPU ID 1
$ for i in 0 1 12; do echo "Core #$i"; unzip -p ../src/clients/drcachesim/tests/drmemtrace.schedtest.x64/cpu_schedule.bin.zip $i | od -t d8 -w32; done
Core #0
0000000 180525 13341574296169946 0 0
0000040
Core #1
0000000 180525 13341574300611346 1 3329113
0000040
Core #12
0000000 180524 13341574296003803 12 0
0000040
$ bin64/drrun -t drcachesim -simulator_type view -only_thread 180525 -indir ../src/clients/drcachesim/tests/drmemtrace.schedtest.x64 -skip_instrs 3329110 -sim_refs 30
Output format:
<--record#-> <--instr#->: <---tid---> <record details>
------------------------------------------------------------
13109282 3329110: 180525 <marker: timestamp 13341574300611269>
13109283 3329110: 180525 <marker: tid 180525 on core 0>
13109284 3329111: 180525 ifetch 3 byte(s) @ 0x00007f25505e5f20 48 89 c7 mov %rax -> %rdi
13109285 3329112: 180525 ifetch 3 byte(s) @ 0x00007f25505e5f23 48 89 d6 mov %rdx -> %rsi
13109286 3329113: 180525 ifetch 3 byte(s) @ 0x00007f25505e5f26 f3 48 a5 rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
13109287 3329113: 180525 read 8 byte(s) @ 0x00007f25519a6ae0 by PC 0x00007f25505e5f26
13109288 3329113: 180525 write 8 byte(s) @ 0x00007f25519a6ce0 by PC 0x00007f25505e5f26
13109289 3329113: 180525 ifetch 3 byte(s) @ 0x00007f25505e5f26 f3 48 a5 rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
13109290 3329113: 180525 read 8 byte(s) @ 0x00007f25519a6ae8 by PC 0x00007f25505e5f26
13109291 3329113: 180525 write 8 byte(s) @ 0x00007f25519a6ce8 by PC 0x00007f25505e5f26
13109292 3329113: 180525 ifetch 3 byte(s) @ 0x00007f25505e5f26 f3 48 a5 rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
13109293 3329113: 180525 read 8 byte(s) @ 0x00007f25519a6af0 by PC 0x00007f25505e5f26
13109294 3329113: 180525 write 8 byte(s) @ 0x00007f25519a6cf0 by PC 0x00007f25505e5f26
13109295 3329113: 180525 ifetch 3 byte(s) @ 0x00007f25505e5f26 f3 48 a5 rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
13109296 3329113: 180525 read 8 byte(s) @ 0x00007f25519a6af8 by PC 0x00007f25505e5f26
13109297 3329113: 180525 write 8 byte(s) @ 0x00007f25519a6cf8 by PC 0x00007f25505e5f26
13109298 3329113: 180525 ifetch 3 byte(s) @ 0x00007f25505e5f26 f3 48 a5 rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
13109299 3329113: 180525 read 8 byte(s) @ 0x00007f25519a6b00 by PC 0x00007f25505e5f26
13109300 3329113: 180525 write 8 byte(s) @ 0x00007f25519a6d00 by PC 0x00007f25505e5f26
13109301 3329113: 180525 ifetch 3 byte(s) @ 0x00007f25505e5f26 f3 48 a5 rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
13109302 3329113: 180525 read 8 byte(s) @ 0x00007f25519a6b08 by PC 0x00007f25505e5f26
13109303 3329113: 180525 write 8 byte(s) @ 0x00007f25519a6d08 by PC 0x00007f25505e5f26
13109304 3329113: 180525 <marker: timestamp 13341574300611280>
13109305 3329113: 180525 <marker: tid 180525 on core 0>
13109306 3329113: 180525 <marker: timestamp 13341574300611346>
13109307 3329113: 180525 <marker: tid 180525 on core 1>
13109308 3329113: 180525 ifetch 3 byte(s) @ 0x00007f25505e5f26 f3 48 a5 rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
13109309 3329113: 180525 read 8 byte(s) @ 0x00007f25519a6b10 by PC 0x00007f25505e5f26
13109310 3329113: 180525 write 8 byte(s) @ 0x00007f25519a6d10 by PC 0x00007f25505e5f26
13109311 3329113: 180525 ifetch 3 byte(s) @ 0x00007f25505e5f26 f3 48 a5 rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
All the rep strings have the same instr count: making it not work well as a divider point! Will have to consider how to handle this.
If we eliminated the ifetch record for rep string (#4948, #4915) it doesn't help: the load/store records themselves could fill the buffer up and trigger cpuid markers in the middle.
Want instr ord for fast skip. So have to store both instr and record count? And have skip_instructions() take in both and no longer always align at the start of an instruction?
This affects record-replay too: though the switch point would just always be the start of the rep string loop.
Theoretically the same thing could happen with a ton of markers in between two instructions.
Since the rep string loop could be quite long, it doesn't seem like we can just ignore this case unfortunately (for the marker case I would say it's not worth more effort than throwing away the offending cpuid marker in the middle).
@prasun3 do you have an opinion on how important this is: if real-world rep string loops are generally not very long maybe never switching in the middle of one is ok? (There are still other aspects to sort out like record-replay with time quanta if we went this route.)
@prasun3 do you have an opinion on how important this is: if real-world rep string loops are generally not very long maybe never switching in the middle of one is ok? (There are still other aspects to sort out like record-replay with time quanta if we went this route.)
I think not switching in the middle of a rep should be okay.
@prasun3 -- another question: what is your preference for whether each iteration of a rep string loop should have a regular instruction fetch record, or not? We had decided in the past that it should not; but an alternative solution here would be to revisit that, to have a separate instruction ordinal for each iteration: either with a regular fetch or even without.
@prasun3 -- another question: what is your preference for whether each iteration of a rep string loop should have a regular instruction fetch record, or not? We had decided in the past that it should not; but an alternative solution here would be to revisit that, to have a separate instruction ordinal for each iteration: either with a regular fetch or even without.
We'd prefer if each iteration did not have an instr fetch record.
For the rep string:
Currently: scheduler records and replays the start only. Problems:
- Wrong cpuid in record-as-traced
- Instr quantum won't switch mid-loop
- Time quantum will switch mid-loop but on replay would switch at start so won't match
Proposal #A: Consider not a big deal to not switch mid-loop
- Hide the wrong cpuid: raw2trace deletes??
- Time quantum delays mid-loop switch??
Proposal #B: Count non-fetched instrs in all instr ordinals
- Confusing with instr records not matching ordinal
Proposal #C: Reverse earlier decisions and insert instr record for each iter
- Won't match PMU counts
- Seems unpopular when users are polled (e.g., https://github.com/DynamoRIO/dynamorio/issues/6354#issuecomment-1760798891)
Proposal #D: Store both an instruction ordinal and a record ordinal everywhere we store instruction ordinals today (as-traced and serial schedule files; record-replay files; range_t region-of-interest bounds; skip_instructions API) and update the skipping code to advance to mid-instruction points (any point?)
Proposal #E: Like #D but we store both an instruction ordinal and sub-instruction "load/store ordinal". This then does not rely on having unfetched instr records (which we have proposed removing) and has more structure than an asolute record ordinal.
Proposal #F: New counter of {fetched+unfetched} instructions used by the scheduler. But skipping and zipfile chunks are based on fetched only.
We've decided that Proposal #E is the winner. This involves bumping the version for a new field in all the schedule file records (the raw2trace-written format and the record-replay format).
The rep string solution may also serve to support running instruction-filtered traces or portions of traces in the scheduler with instruction quanta, skips, and replayed schedules.