dynamorio ARM: `drutil_insert_get_mem_addr` no longer works correctly on AArch64

https://github.com/DynamoRIO/dynamorio/pull/5283 introduced a change to decoding of load instructions, as a result of that change load index registers can be decoded as w registers (as opposed to x registers as previously).

Not all places in DR are prepared to encounter a w register in a load index context. For example, this logic in drutil_insert_get_mem_addr_arm

        reg_id_t index = opnd_get_index(memref);

        // ...

        reg_id_t stolen = dr_get_stolen_reg();

        // ...

        } else if (index == stolen) {
            index = replace_stolen_reg(drcontext, bb, where, memref, dst, scratch,
                                       scratch_used);
        }

now doesn't recognize w index registers as stolen, because dr_get_stolen_reg() typically returns an x register.

This results in instrumentation being silently incorrect and drutil_insert_get_mem_addr_arm producing incorrect addresses.

Some other places besides drutil_insert_get_mem_addr_arm may also be affected by this change.

May 24 '22 14:05 petrochenkov

This may have resulted in incorrect addresses in AArch64 traces gathered since Jan 18. Is one way to check whether any instruction matches ldr.*w28 or str.*w28?

May 24 '22 19:05 derekbruening

This is rather disturbing. What can we do to avoid such regressions in the future?

Add more comprehensive regression tests for drutil_insert_get_mem_addr covering stolen register corner cases
Add checks for unmapped addresses to the invariant_checker trace analyzer: but this requires having metadata on the valid mapped regions, which are not gathered by the Github drmemtrace code today. This is no guarantee as the computed incorrect address could happen to land in a mapped region.

May 24 '22 19:05 derekbruening

Is one way to check whether any instruction matches ldr.*w28 or str.*w28?

The w28 must be an index, so it would be ldr.*, w28 or str.*, w28 for arm-style disassembly -- right? Planning to search through objdump output as an initial check.

May 25 '22 04:05 derekbruening

This is rather disturbing. What can we do to avoid such regressions in the future?

It is! Adding more comprehensive regression tests for drutil_insert_get_mem_addr() covering stolen register corner cases is definitely something we should do.

In general, are there areas of DR core and API code which rely heavily on the correctness of generated instructions which should also have more comprehensive test coverage? Can we identify a set of INSTR_CREATE_ macros and DR functions which need better test coverage.

Can we identify these code paths and improve error trapping with e.g. ASSERTs, to fail loudly rather than silently go undetected?

Does #2443 cover enough of this?

May 25 '22 16:05 AssadHashmi

So the #5283 fix itself was addressing incorrect addresses which I don't remember realizing when that fix went in: so older traces pre-Jan 18 also had a source of inaccuracies.

May 25 '22 16:05 derekbruening

So the #5283 fix itself was addressing incorrect addresses which I don't remember realizing when that fix went in: so older traces pre-Jan 18 also had a source of inaccuracies.

Could this be quantified? If we want to search for load/store instructions whose addresses were computed incorrectly prior to the #5283 fix what would we search for? Is it that any scaled index register was treated as not scaled at all?

May 25 '22 22:05 derekbruening

Yes, it's correlated with the extend- if the extend is UXTW or SXTW then it should be a W register , if it is SXTX or LSL/UXTX, it should be an X register

May 26 '22 10:05 joshua-warburton

Yes, it's correlated with the extend- if the extend is UXTW or SXTW then it should be a W register , if it is SXTX or LSL/UXTX, it should be an X register

But the extend scale is also incorrect, right? In the diff for PR #5283 I see things like this: At dis-a64.txt:13145 it was:

str    %x1 -> (%x2,%x3,uxtx #3)[8byte]

And now it's:

str    %x1 -> (%x2,%x3,lsl #3)[8byte]

So a trace gathered prior to that PR that had a store with a scale would have incorrectly ignored the scale and computed the wrong address? (I think uxtx #3 ignores the #3, i.e., it only extends and doesn't scale/shift?)

May 26 '22 13:05 derekbruening

Indeed, it is incorrect, but it as at least consistently incorrect. UXTX only appears where it should be LSL, and the correct registers can be derived from that if we want to find where the decode errors occurred pre #5283.

I'm not sure on the difference between uxtx #3 and lsl #3. I'll have to defer to @AssadHashmi on that

May 26 '22 15:05 joshua-warburton

I'm not sure on the difference between uxtx #3 and lsl #3. I'll have to defer to @AssadHashmi on that

My understanding is that UXTX shouldn't appear in X reg offsets because zero extending a 64 bit index value to 64 bits is meaningless, a no-op extension. decode_opnd_memreg_size() in #5283 just converted UXTX to the preferred alias LSL and opnd_base_disp_scale_disassemble() fixed it in disassembler.

So a trace gathered prior to that PR that had a store with a scale would have incorrectly ignored the scale and computed the wrong address? (I think uxtx #3 ignores the #3, i.e., it only extends and doesn't scale/shift?)

I don't think that's the case. The values used to create memref with opnd_create_base_disp_aarch64() are the same including scaled and extend_type but after #5283, extend_type is disassembled as LSL rather than UXTX.

May 26 '22 17:05 AssadHashmi

So a trace gathered prior to that PR that had a store with a scale would have incorrectly ignored the scale and computed the wrong address? (I think uxtx #3 ignores the #3, i.e., it only extends and doesn't scale/shift?)

I don't think that's the case. The values used to create memref with opnd_create_base_disp_aarch64() are the same including scaled and extend_type but after #5283, extend_type is disassembled as LSL rather than UXTX.

OK, so a trace prior to PR #5283 would only have an incorrect address related to that PR's changes if it had an X register as an index that really should have been a W and it had a non-zero value in the upper bits? The scaling and extension changes from that PR are purely cosmetic.

The PR description of The "extension" LSL #0 was being decoded as uxtx #0, combined with the diff showing changes that weren't with just #0, made it seem like things were being decoded incorrectly, not just being disassembled incorrectly.

I did a sanity test:

After PR #5283:

derek@dynamorio:~/dr/build$ rm -rf drmemtrace.*.dir; setarch `uname -m` -R bin64/drrun -stderr_mask 0 -t drcachesim -offline -- ~/dr/test/pr5283 && bin64/drrun -stderr_mask 0 -t drcachesim -indir drmemtrace.*.dir -simulator_type view
All done
...
       11: T1738749   0x0000000000400154  910003e2   add    %sp $0x0000 lsl $0x00 -> %x2
       12: T1738749   0x0000000000400158  d2800523   movz   $0x0029 lsl $0x00 -> %x3
       13: T1738749   0x000000000040015c  f8237841   str    %x1 -> (%x2,%x3,lsl #3)[8byte]
       14: T1738749     write 8 byte(s) @ 0xfffffffff188

Going back before the PR:

derek@dynamorio:~/dr/build$ rm -rf drmemtrace.*.dir; setarch `uname -m` -R bin64/drrun -stderr_mask 0 -t drcachesim -offline -- ~/dr/test/pr5283 && bin64/drrun -stderr_mask 0 -t drcachesim -indir drmemtrace.*.dir -simulator_type view
All done
...
       11: T1731912   0x0000000000400154  910003e2   add    %sp $0x0000 lsl $0x00 -> %x2
       12: T1731912   0x0000000000400158  d2800523   movz   $0x0029 lsl $0x00 -> %x3
       13: T1731912   0x000000000040015c  f8237841   str    %x1 -> (%x2,%x3,uxtx #3)[8byte]
       14: T1731912     write 8 byte(s) @ 0xfffffffff188

Assuming my ASLR-disabling test will get the same stack address (sure seems like it; too lazy to create a different constant address instead of using SP in a tiny asm test; maybe should have used a .text address and a load) it does seem that the lsl #3 vs uxtx #3 disassembly difference made no impact on the traced address.

May 26 '22 19:05 derekbruening

Through unrelated validation of virtual-to-physical address gathering, we discovered a source of incorrect addresses in our traces: instructions with w28 as the index register. It seems that the fix in #5511 did not actually result in correct addresses. (Incidentally, was that PR meant to close this issue? Was there other work that left it open?)

Creating a test case we can see that the addresses are wrong:

        mov      x28, sp
        mov      x27, sp
        add      x5, x27, w28, sxtw #3
        ldr      x4, [x27]
        ldr      x4, [x27, w28, sxtw #3]

I ran it without ASLR which just happens to not crash on that memref:

derek@dynamorio:~/dr/build$ pushd ~/dr/test; as -mcpu=cortex-a55 -o allasm_aarch64.o allasm_aarch64.s && gcc -static -o allasm_aarch64 allasm_aarch64.o -nostartfiles; popd
derek@dynamorio:~/dr/build$ rm -rf drmemtrace.*.dir; setarch `uname -m` -R bin64/drrun -t drcachesim -offline -- ~/dr/test/allasm_aarch64
derek@dynamorio:~/dr/build$ bin64/drrun -t drcachesim -indir drmemtrace.*.dir -simulator_type view 2>&1 | head -40

       14: T820058 ifetch       4 byte(s) @ 0x0000000000400158 910003fc   add    %sp $0x0000 lsl $0x00 -> %x28
       15: T820058 ifetch       4 byte(s) @ 0x000000000040015c 910003fb   add    %sp $0x0000 lsl $0x00 -> %x27
       16: T820058 ifetch       4 byte(s) @ 0x0000000000400160 8b3ccf65   add    %x27 %w28 sxtw $0x03 -> %x5
       17: T820058 ifetch       4 byte(s) @ 0x0000000000400164 f9400364   ldr    (%x27)[8byte] -> %x4
       18: T820058 read         8 byte(s) @ 0x0000fffffffff040 by PC 0x0000000000400164
       19: T820058 ifetch       4 byte(s) @ 0x0000000000400168 f87cdb64   ldr    (%x27,%w28,sxtw #3)[8byte] -> %x4
       20: T820058 read         8 byte(s) @ 0x0000ffffc0187040 by PC 0x0000000000400168

That recorded address 0x0000ffffc0187040 sure looks wrong. It should be:

(gdb) p/x 0x0000fffffffff040 + (((int) 0x0000fffffffff040)<<3)
$9 = 0xffffffff7240

Instru:

after instrumentation:
TAG  0x0000000000400158
 +0    m4 @0x0000fffdf804abb0  f940d784   ldr    +0x01a8(%x28)[8byte] -> %x4
 +4    m4 @0x0000fffdf804aa68  d2802b05   movz   $0x0158 lsl $0x00 -> %x5
 +8    m4 @0x0000fffdf804a9e8  f2e40185   movk   %x5 $0x200c lsl $0x30 -> %x5
 +12   m4 @0x0000fffdf804a968  f9000085   str    %x5 -> (%x4)[8byte]
 +16   m4 @0x0000fffdf804ab30  f9000085   <label note=0x0000000000000000>
 +16   m4 @0x0000fffdf804a8e8  91002084   add    %x4 $0x0008 lsl $0x0000000000000000 -> %x4
 +20   m4 @0x0000fffdf804a868  f900d784   str    %x4 -> +0x01a8(%x28)[8byte]
 +24   m4 @0x0000fffdf8049948  f900d784   <label note=0x0000000000000000>
 +24   L3 @0x0000fffdf8049160  910003fc   add    %sp $0x0000 lsl $0x00 -> %x28
 +28   L3 @0x0000fffdf804ac30  910003fb   add    %sp $0x0000 lsl $0x00 -> %x27
 +32   L3 @0x0000fffdf8049690  8b3ccf65   add    %x27 %w28 sxtw $0x03 -> %x5
 +36   m4 @0x0000fffdf804a720  f940d784   ldr    +0x01a8(%x28)[8byte] -> %x4
 +40   m4 @0x0000fffdf804a5d8  f900009b   str    %x27 -> (%x4)[8byte]
 +44   m4 @0x0000fffdf804a6a0  f900009b   <label note=0x0000000000000000>
 +44   m4 @0x0000fffdf804a510  91002084   add    %x4 $0x0008 lsl $0x0000000000000000 -> %x4
 +48   m4 @0x0000fffdf804a490  f900d784   str    %x4 -> +0x01a8(%x28)[8byte]
 +52   m4 @0x0000fffdf804a7a0  f900d784   <label note=0x0000000000000000>
 +52   L3 @0x0000fffdf80497e0  f9400364   ldr    (%x27)[8byte] -> %x4
 +56   m4 @0x0000fffdf804a300  f940d784   ldr    +0x01a8(%x28)[8byte] -> %x4
 +60   m4 @0x0000fffdf804a200  f900ab80   str    %x0 -> +0x0150(%x28)[8byte]
 +64   m4 @0x0000fffdf804a138  f9401b80   ldr    +0x30(%x28)[8byte] -> %x0
 +68   m4 @0x0000fffdf804a070  8b3ccf60   add    %x27 %w28 sxtw $0x0000000000000003 -> %x0
 +72   m4 @0x0000fffdf8049ff0  f9000080   str    %x0 -> (%x4)[8byte]
 +76   m4 @0x0000fffdf804a280  f9000080   <label note=0x0000000000000000>
 +76   m4 @0x0000fffdf8049f28  91002084   add    %x4 $0x0008 lsl $0x0000000000000000 -> %x4
 +80   m4 @0x0000fffdf8049ea8  f900d784   str    %x4 -> +0x01a8(%x28)[8byte]
 +84   m4 @0x0000fffdf804a3c8  f900d784   <label note=0x0000000000000000>
 +84   L3 @0x0000fffdf8049778  f87cdb64   ldr    (%x27,%w28,sxtw #3)[8byte] -> %x4

It's not restoring the app's x28 value, so the w28 it uses is DR's.

So all aarch64 traces have had incorrect addresses for w28 index registers, all this time.

Sep 09 '22 21:09 derekbruening