drmemory icon indicating copy to clipboard operation
drmemory copied to clipboard

CRASH umbra_client_faulty_redzone on GA CI 32-bit Windows

Open derekbruening opened this issue 4 years ago • 8 comments

This happened in the 2nd of 2 runs (re-ran): https://github.com/DynamoRIO/drmemory/pull/2373/checks?check_run_id=1801981894

2021-01-31T18:17:22.8577386Z 136: Test command: D:\a\drmemory\drmemory\build_drmemory-dbg-32\dynamorio\bin32\drrun.exe "-quiet" "-debug" "-client" "D:/a/drmemory/drmemory/build_drmemory-dbg-32/tests/umbra_client_faulty_redzone.client.dll" "0" "" "-msgbox_mask" "0" "-stderr_mask" "0xc" "--" "D:/a/drmemory/drmemory/build_drmemory-dbg-32/tests/umbra_app.exe"
2021-01-31T18:17:22.8578904Z 136: Test timeout computed to be: 60
2021-01-31T18:17:23.1433153Z 136: <Exception in cache 0x1bf93047 interpreting DR code 0x7006e79a>
2021-01-31T18:17:23.1434336Z 136: <Application D:\a\drmemory\drmemory\build_drmemory-dbg-32\tests\umbra_app.exe (3200).  Tool internal crash at PC 0x7006e79a.  Please report this at your tool's issue tracker.  Program aborted.
2021-01-31T18:17:23.1435367Z 136: 0xc0000005 0x00badcad 0x7006e79a 0x7006e79a 0x00000001 0x1c049e75
2021-01-31T18:17:23.1435732Z 136: Base: 0x70050000
2021-01-31T18:17:23.1436157Z 136: Registers: eax=0x00000000 ebx=0x002fa000 ecx=0x00000000 edx=0x704be000
2021-01-31T18:17:23.1436659Z 136: 	esi=0x00000000 edi=0x00000000 esp=0x004ff9d4 ebp=0x004ff9d4
2021-01-31T18:17:23.1437050Z 136: 	eflags=0x00010216
2021-01-31T18:17:23.1437373Z 136: version 8.0.18658, custom build
2021-01-31T18:17:23.1438672Z 136: -no_dynamic_options -client_lib 'D:\a\drmemory\drmemory\build_drmemory-dbg-32\tests\umbra_client_faulty_redzone.client.dll;0;' -client_lib32 'D:\a\drmemory\drmemory\build_drmemory-dbg-32\tests\umbra_client_faulty_redzone.client.dll;0;' -code_api -probe_api -msgbox_mask 0 -stderr_mask 12 -stack_size 56K -max_elide_jmp 0 -
2021-01-31T18:17:23.1440058Z 136: 0x004ff9d4 0x702cd536>
2021-01-31T18:17:23.1515566Z 135/145 Test #136: umbra_client_faulty_redzone .......***Failed  

derekbruening avatar Jan 31 '21 18:01 derekbruening

I don't have a Windows machine at hand and therefore it is difficult for me to see what exactly is going on. However, I had a look at the test's code, and I think I found a bug which I missed to fix earlier. Making a PR soon that will hopefully fix the issue. If not, we will just re-open it.

johnfxgalea avatar Jan 31 '21 23:01 johnfxgalea

I can reproduce this on my Win10 2004 machine. Below is some data at the point where it raises Exception in cache 0x1d883047 interpreting DR code 0x7111e79a. I'm wondering if this is a case of DR over-reacting. For takeover on Windows, DR starts at a few return instructions in dynamorio.dll code. If a client causes a fault that translates back to dynamorio.dll it looks like this callback.c code issuing this message treats it as a fatal DR crash. Does that theory as to what is happening make sense, that your client would deliberately raise a handled fault on an app stack pop? As to what to do about it: I guess just remove the whole code sequence around that message so it is then passed to the app? Maybe leave it if there is no client as an extra condition.

0:000> U 7111e79a
dynamorio!dynamorio_earliest_init_takeover_C+0x10a [d:\derek\drmemory\git\src\dynamorio\core\dynamo.c @ 3102]:
7111e79a 5d              pop     ebp
7111e79b c3              ret
7111e79c cc              int     3
7111e79d cc              int     3
7111e79e cc              int     3
7111e79f cc              int     3
dynamorio!dynamorio_protect [d:\derek\drmemory\git\src\dynamorio\core\dynamo.c @ 3116]:
7111e7a0 55              push    ebp
7111e7a1 8bec            mov     ebp,esp
0:000> U 1d883047
1d883047 888d00100000    mov     byte ptr [ebp+1000h],cl
1d88304d 5d              pop     ebp
1d88304e 648915d80e0000  mov     dword ptr fs:[0ED8h],edx
1d883055 0fb615846b8e1d  movzx   edx,byte ptr ds:[1D8E6B84h]
1d88305c 83fa00          cmp     edx,0
1d88305f 0f851f000000    jne     1d883084
1d883065 8d1424          lea     edx,[esp]
1d883068 648915e80e0000  mov     dword ptr fs:[0EE8h],edx
0:000> dv
              wrapper = struct _fragment_t
                  res = RECREATE_SUCCESS_STATE (0n2)
                    f = 0x1d855a5c
          faulting_pc = 0x1d883047 "???"
         raw_mcontext = struct _priv_mcontext_t
             mcontext = struct _priv_mcontext_t
                  cxt = 0x001ff7a0
             takeover = 0n1 ''
         known_source = 0n0 ''
forged_exception_addr = 0x00000000 ""
            pExcptRec = 0x001ff750
            fault_xsp = 0x001ffc00 ""
       thread_is_lost = 0n0 ''
                state = 0x1d8c1d70
             dcontext = 0x1d8aa040
0:000> dt pExcptRec
Local var @ 0x1d8c1d5c Type _EXCEPTION_RECORD*
0x001ff750 
   +0x000 ExceptionCode    : 0xc0000005
   +0x004 ExceptionFlags   : 0
   +0x008 ExceptionRecord  : (null) 
   +0x00c ExceptionAddress : 0x7111e79a Void
   +0x010 NumberParameters : 2
   +0x014 ExceptionInformation : [15] 1
0:000> dt -b pExcptRec
Local var @ 0x1d8c1d5c Type _EXCEPTION_RECORD*
0x001ff750 
   +0x000 ExceptionCode    : 0xc0000005
   +0x004 ExceptionFlags   : 0
   +0x008 ExceptionRecord  : (null) 
   +0x00c ExceptionAddress : 0x7111e79a 
   +0x010 NumberParameters : 2
   +0x014 ExceptionInformation : 
    [00] 1
    [01] 0x1d939f00

derekbruening avatar Feb 01 '21 02:02 derekbruening

Note that the default Windows DR injection mechanism was tweaked in PR #4653 and IIRC the prior default mechanism did not execute DR code from the cache at the start? So it's possible this never happened until recently with this test. Hmm, in fact it's possible my update of DR revealed this -- though I thought I did a pull after PR #4653? Checking...looks like DrM was using DR from Dec 6 before. Hmm.

derekbruening avatar Feb 01 '21 02:02 derekbruening

Does that theory as to what is happening make sense, that your client would deliberately raise a handled fault on an app stack pop?

Yes, the test inserts code that deliberately raises faults at app instructions that access memory (incl. pop).

As to what to do about it: I guess just remove the whole code sequence around that message so it is then passed to the app? Maybe leave it if there is no client as an extra condition.

Not sure if I follow your suggestion. Are you talking about a fix from DynamoRIO's standpoint or do you mean I make the test skip the instrumentation of the first N basic blocks to avoid the takeover code?

johnfxgalea avatar Feb 01 '21 13:02 johnfxgalea

Does that theory as to what is happening make sense, that your client would deliberately raise a handled fault on an app stack pop?

Yes, the test inserts code that deliberately raises faults at app instructions that access memory (incl. pop).

Is it non-deterministic which ones raise faults? Just trying to explain why it didn't show up in the first run for my PR.

As to what to do about it: I guess just remove the whole code sequence around that message so it is then passed to the app? Maybe leave it if there is no client as an extra condition.

Not sure if I follow your suggestion. Are you talking about a fix from DynamoRIO's standpoint or do you mean I make the test skip the instrumentation of the first N basic blocks to avoid the takeover code?

I think we should change DR itself. This could happen with any client and so a fault in this takeover code does not indicate an internal DR error. The simplest thing is to move the client_exception_event call up above the is_dynamo_address code block. Fortunately this is_dynamo_address is already inside a code cache context so this is not any crash in DR code, so this doesn't hit #50 and its complications.

derekbruening avatar Feb 01 '21 15:02 derekbruening

The test tries to access faulty red zones by accessing app addresses' shadow memory with a fixed displacement equivalent to a page size. Therefore, this heuristic may not lead the same instrumented instruction to trigger the fault given the non-deterministic nature of the memory layout. I suspect, in the first run, the fault didn't originate from dynamorio.dll code.

johnfxgalea avatar Feb 01 '21 16:02 johnfxgalea

I added this test to the ignore list in PR #2384 so when it is fixed it should be removed from there.

derekbruening avatar Feb 07 '21 18:02 derekbruening