dynamorio icon indicating copy to clipboard operation
dynamorio copied to clipboard

Tests failing with "Encoding size 2 != instr size 1": drcacheoff.invariant_checker, drcachesim.opcode_mix

Open derekbruening opened this issue 1 year ago • 11 comments

Happened on x86-64 here: https://github.com/DynamoRIO/dynamorio/actions/runs/6113660426/job/16593592056?pr=6302

390: CMake Error at /home/runner/work/dynamorio/dynamorio/suite/tests/runmulti.cmake:111 (message):
390:   *** postcmd failed (Subprocess aborted): Encoding size 2 != instr size 1
390: 
390:   drcachesim:
390:   /home/runner/work/dynamorio/dynamorio/clients/drcachesim/reader/reader.cpp:203:
390:   virtual bool dynamorio::drmemtrace::reader_t::process_input_entry():
390:   Assertion `false' failed.
390: 
390:   ***
390: 
390: Call Stack (most recent call first):
390:   /home/runner/work/dynamorio/dynamorio/suite/tests/runmulti.cmake:123 (process_cmdline)
390: 
390: 
389/471 Test #390: code_api|tool.drcacheoff.invariant_checker .......................***Failed    2.08 sec

derekbruening avatar Sep 07 '23 19:09 derekbruening

Happened in the x86-32 run of tool.drcachesim.opcode_mix on the PR #6313: https://github.com/DynamoRIO/dynamorio/actions/runs/6227837689/job/16903332100?pr=6313

abhinav92003 avatar Sep 19 '23 16:09 abhinav92003

Actually I can reproduce this on my wkst on off.invar so I will try to figure it out.

derekbruening avatar Sep 19 '23 16:09 derekbruening

       52953       42999:     1958180 ifetch       6 byte(s) @ 0x00007fe4a353703d 48 3d 00 f0 ff ff    cmp    %rax, $0xfffff000
       52954       43000:     1958180 ifetch       2 byte(s) @ 0x00007fe4a3537043 77 33                jnbe   $0x00007fe4a3537078 (untaken)
       52955       43000:     1958180 <marker: chunk footer #42>
Encoding size 2 != instr size 1
drcachesim: /usr/local/google/home/bruening/dr/git/src/clients/drcachesim/reader/reader.cpp:203: virtual bool dynamorio::drmemtrace::reader_t::process_input_entry(): Assertion `false' failed.
$ unzip -p suite/tests/tool.drcacheoff.invariant_checker.simple_app.1958180.1038.dir/trace/t*.zip chunk.0043 | od -A x -t x2 -w12 | awk '{printf "%s | %s %s %s%s%s%s\n", $1, $2, $3, $7, $6, $5, $4}' | head
000000 | 001c 0016 000000000000cedb
00000c | 001c 0002 002f644f0958421d
000018 | 001c 0003 0000000000000000
000024 | 002f 0001 00000000000000c3  encoding
000030 | 001c 001d 00007fe4a3521d46  branch target
00003c | 002f 0001 00000000000000c3  encoding repeated!
000048 | 0010 0001 00007fe4a3537045

The encoding is repeated across the indirect branch marker, and the encoding entries are combined with no instr in between. So the problem is in raw2trace inserted another encoding around the new marker type.

derekbruening avatar Sep 19 '23 16:09 derekbruening

I reproduced in raw2trace_unit_tests: requires a chunk splitting a branch;ind-branch pair:

[drmemtrace]: Creating new chunk #2 for thread 1                        
[drmemtrace]: Chunk #3 ord marker 20                                                      
[drmemtrace]: Adding post-chunk-boundary encoding entry for decode=0x560ca99ed582 app=0x1a
[drmemtrace]: Appended encoding entry for 0x560ca99ed582 sz=1 0x000000c3... 
[drmemtrace]: Chunk instr count is now 1                                                  
[drmemtrace]: Appended encoding entry for 0x560ca99ed576 sz=3 0x00c28b48... 
[drmemtrace]: Appended instr fetch for original 0xe                                  
<...>                                               
35 type: 28 size: 21 val: 1                                                               
36 type: 28 size: 22 val: 20                                                              
37 type: 28 size: 2 val: 5                                                                
38 type: 28 size: 3 val: 0                                                                
39 type: 47 size: 1 val: 195            
40 type: 28 size: 29 val: 14                                                              
41 type: 47 size: 1 val: 195                                                              
42 type: 16 size: 1 val: 26                         

After fix in unit test:

[drmemtrace]: Creating new chunk #2 for thread 1                              
[drmemtrace]: Chunk #3 ord marker 20                                                      
[drmemtrace]: Chunk instr count is now 1                                                  
[drmemtrace]: Appended encoding entry for 0x5620d9117576 sz=3 0x00c28b48... 
[drmemtrace]: Appended instr fetch for original 0xe                                     
<...>
35 type: 28 size: 21 val: 1                                                               
36 type: 28 size: 22 val: 20                                                              
37 type: 28 size: 2 val: 5                                                                
38 type: 28 size: 3 val: 0                                                                
39 type: 47 size: 1 val: 195                                                              
40 type: 28 size: 29 val: 14                                                              
41 type: 16 size: 1 val: 26             

And the invariant test also works:

$ rm -rf suite/tests/tool.drcacheoff.invariant_checker.simple_app.1958180.1038.dir/trace; bin64/drrun -t drcachesim -chunk_instr_count 1000 -indir suite/tests/tool.drcacheoff.invariant_checker.simple_app.1958180.1038.dir -simulator_type view -sim_refs 14 -skip_instrs 42997
Output format:
<--record#-> <--instr#->: <---tid---> <record details>
------------------------------------------------------------
       52944       42997:     1958180 <marker: timestamp 13339614527339825>
       52945       42997:     1958180 <marker: tid 1958180 on core 11>
       52946       42998:     1958180 ifetch       2 byte(s) @ 0x00007fe4a353703b 0f 05                syscall
       52947       42998:     1958180 <marker: timestamp 13339614527373839>
       52948       42998:     1958180 <marker: tid 1958180 on core 0>
       52949       42998:     1958180 <marker: system call 257>
       52950       42998:     1958180 <marker: maybe-blocking system call>
       52951       42998:     1958180 <marker: timestamp 13339614527373853>
       52952       42998:     1958180 <marker: tid 1958180 on core 0>
       52953       42999:     1958180 ifetch       6 byte(s) @ 0x00007fe4a353703d 48 3d 00 f0 ff ff    cmp    %rax, $0xfffff000
       52954       43000:     1958180 ifetch       2 byte(s) @ 0x00007fe4a3537043 77 33                jnbe   $0x00007fe4a3537078 (untaken)
       52955       43000:     1958180 <marker: chunk footer #42>
       52956       43001:     1958180 ifetch       1 byte(s) @ 0x00007fe4a3537045 c3                   ret
       52957       43001:     1958180 read         8 byte(s) @ 0x00007ffc4cdf8638 by PC 0x00007fe4a3537045

derekbruening avatar Sep 20 '23 00:09 derekbruening

This still happens in drcachesim.opcode_mix: https://github.com/DynamoRIO/dynamorio/actions/runs/6489025607/job/17622554375?pr=6357

307: Hello, world!
307: Encoding size 2 != instr size 6
307: drcachesim: /home/runner/work/dynamorio/dynamorio/clients/drcachesim/reader/reader.cpp:203: virtual bool dynamorio::drmemtrace::reader_t::process_input_entry(): Assertion `false' failed.

derekbruening avatar Oct 12 '23 01:10 derekbruening

Happened again: https://github.com/DynamoRIO/dynamorio/actions/runs/8516627602/job/23325947839?pr=6749

309: Hello, world!
309: Encoding size 2 != instr size 6 for PC 0xf7b2157c at ord 253874 instr 176794 last_timestamp=0x4c653edf
309: drcachesim: /home/runner/work/dynamorio/dynamorio/clients/drcachesim/reader/reader.cpp:213: virtual bool dynamorio::drmemtrace::reader_t::process_input_entry(): Assertion `false' failed.
284/371 Test #309: code_api|tool.drcachesim.opcode_mix ..............................***Failed  

derekbruening avatar Apr 02 '24 14:04 derekbruening

Xref #3320 with weird online record reorderings over named pipes. This drcacesim.opcode_mix is an online run with -instr_encodings turned on.

derekbruening avatar Apr 02 '24 14:04 derekbruening

Failed one time on #6691.

edeiana avatar Apr 03 '24 19:04 edeiana

drcachesim.opcode_mix continues to fail periodically, in post-merge here: https://github.com/DynamoRIO/dynamorio/actions/runs/8572128463/job/23493944852

derekbruening avatar Apr 05 '24 16:04 derekbruening

drcachesim.opcode_mix failed here: https://github.com/DynamoRIO/dynamorio/actions/runs/8575475599/job/23504525032 https://github.com/DynamoRIO/dynamorio/actions/runs/8600139259/job/23564492380 Eg,

2024-04-05T20:45:19.2858331Z 310: Hello, world!
2024-04-05T20:45:19.2859316Z 310: Encoding size 2 != instr size 6 for PC 0xf72ff12e at ord 254549 instr 177282 last_timestamp=0xd12e7567
2024-04-05T20:45:27.1286674Z 285/771 Test #310: code_api|tool.drcachesim.opcode_mix .......................................................***Failed  Required r\
egular expression not found. Regex=[^Hello, world!
2024-04-05T20:45:27.1288417Z ---- <application exited with code 0> ----

xdje42 avatar Apr 08 '24 20:04 xdje42

drcachesim.opcode_mix failed here: https://github.com/DynamoRIO/dynamorio/actions/runs/8626033117/job/23643564020 https://github.com/DynamoRIO/dynamorio/actions/runs/8652727426/job/23726294897

xdje42 avatar Apr 12 '24 20:04 xdje42

Happened after 524 runs:

$ ctest --repeat-until-fail 10000 -R drcachesim.opcode_mix

----------------------------------------------------------
Hello, world!
Encoding size 2 != instr size 6 for PC 0xf7273edb at ord 206078 instr 146557 last_timestamp=0x657be903

Candidates from simple_app offline near the very end with PC page offset 0xedb:

      196961      140601:     1164739 ifetch       6 byte(s) @ 0xf71f1edb 0f 85 cb 00 00 00    jnz    $0xf71f1fac (untaken)

      203588      144926:     1164739 ifetch       6 byte(s) @ 0xf723dedb 8b 9f 0c 41 00 00    mov    0x0000410c(%edi), %ebx

Probably that last one:

      203584      144924:     1164739 ifetch       8 byte(s) @ 0xf723decd f0 0f b1 97 ec 40 00 lock cmpxchg %edx, 0x000040ec(%edi), %eax
      203584      144924:     1164739                                             00
      203585      144924:     1164739 read         4 byte(s) @ 0xf7360f20 by PC 0xf723decd
      203586      144924:     1164739 write        4 byte(s) @ 0xf7360f20 by PC 0xf723decd
      203587      144925:     1164739 ifetch       6 byte(s) @ 0xf723ded5 0f 85 25 01 00 00    jnz    $0xf723e000 (untaken)
      203588      144926:     1164739 ifetch       6 byte(s) @ 0xf723dedb 8b 9f 0c 41 00 00    mov    0x0000410c(%edi), %ebx
      203589      144926:     1164739 read         4 byte(s) @ 0xf7360f40 by PC 0xf723dedb
      203590      144927:     1164739 ifetch       2 byte(s) @ 0xf723dee1 85 db                test   %ebx, %ebx

Adding a little more info and re-running: it happened after 1345 runs.

Hello, world!
Encoding size 2 != instr size 6 for PC 0xf72d4edb at ord 206078 instr 146557 last_timestamp=0x9294503e enc: 0 0 inst type: 10

32-bit fits only 4 encoding bytes per record: so the jnz would have a 2nd encoding record with 0 0. Did the 2nd jnz encoding record arrive out of order, after the ifetch record for the mov? Is that possible: pipe buffer atomicity doesn't match kernel atomicity, or pipe buffer splitting is somehow messed up?

derekbruening avatar Aug 23 '24 02:08 derekbruening

Changing named_pipe_t::get_atomic_write_size() to return 512 makes this happen much more frequently: after 13, 14 runs:

Encoding size 2 != instr size 6 for PC 0xf72933d7 at ord 206075 instr 146557 last_timestamp=0xbd6cdb40 enc: ff ff inst type: 13
Encoding size 2 != instr size 6 for PC 0xf728f3d7 at ord 206075 instr 146557 last_timestamp=0xc5d5ea22 enc: ff ff inst type: 13

Aha: is_ok_to_split_before() thinks it's ok to split before an encoding record: it doesn't consider multiple consecutive encodings!

derekbruening avatar Aug 23 '24 02:08 derekbruening

With the fix, the test passes 10,000 times in a row with the 512 pipe setting:

$ ctest --repeat-until-fail 10000 -R drcachesim.opcode_mix
...
    Start 310: code_api|tool.drcachesim.opcode_mix
    Test #310: code_api|tool.drcachesim.opcode_mix ...   Passed    1.72 sec
    Start 310: code_api|tool.drcachesim.opcode_mix
1/1 Test #310: code_api|tool.drcachesim.opcode_mix ...   Passed    1.79 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 17341.30 sec

derekbruening avatar Aug 23 '24 13:08 derekbruening