dynamorio icon indicating copy to clipboard operation
dynamorio copied to clipboard

ARM suite test regressions: api.disA32, api.disT32, many more

Open derekbruening opened this issue 6 years ago • 10 comments

We did not have ARM tests being run for a while. Now that we have a CDash machine back up, I compared its debug run yesterday to the most recent on that same machine name from April 2018:

http://dynamorio.org/CDash/viewTest.php?buildid=35255

Testing started on 2018-04-18 23:37:13
Name	Status	Time	Details	Labels
code_api|client.drmgr-test	Failed	5s 810ms	Completed (Failed)	
code_api|linux.signal_race	Failed	1m 30s 690ms	Completed (Failed)	
code_api|linux.eintr-noinline	Failed	1m 30s 220ms	Completed (Failed)	
code_api|tool.drcachesim.invariants	Failed	1m 30s 410ms	Completed (Failed)	
code_api|tool.histogram.offline	Failed	1m 31s 760ms	Completed (Failed)	
code_api|tool.drcachesim.delay-simple	Failed	8s 340ms	Completed (Failed)

http://dynamorio.org/CDash/viewTest.php?buildid=44194

Testing started on 2019-06-21 23:30:26
Name	Status	Time	Details	Labels
code_api|linux.signal_race	Failed	1m 30s 20ms	Completed (Failed)	
code_api|linux.eintr-noinline	Failed	1m 30s 180ms	Completed (Failed)	
 code_api|linux.sigsuspend	Failed	1m 30s 100ms	Completed (Failed)	
 code_api|tool.drcacheoff.simple	Failed	6s 770ms	Completed (Failed)	
 code_api|tool.drcacheoff.filter	Failed	8s 820ms	Completed (Failed)	
 code_api|tool.drcacheoff.opcode_mix	Failed	7s 620ms	Completed (Failed)	
 code_api|linux.syscall_pwait	Failed	10s 890ms	Completed (Failed)	
 code_api|client.predicate-test	Failed	1s 110ms	Completed (Failed)	
 code_api|client.drreg-test	Failed	1s 520ms	Completed (Failed)	
code_api|client.drmgr-test	Failed	3s 410ms	Completed (Failed)	
 code_api|client.drutil-test	Failed	2s 420ms	Completed (Failed)	
 code_api|sample.opcodes	Failed	510ms	Completed (Failed)	
code_api|tool.drcachesim.delay-simple	Failed	4s 110ms	Completed (Failed)	
 code_api|tool.drcacheoff.multiproc	Failed	12s 390ms	Completed (Failed)	
 code_api|api.disA32	Failed	1s 360ms	Completed (Failed)	
 code_api|api.disT32	Failed	1s 430ms	Completed (Failed)

derekbruening avatar Jun 24 '19 15:06 derekbruening

I'm assuming these regressions have happened because we don't have AArch32 precommit testing in the same way we have for AArch64?

AssadHashmi avatar Jun 25 '19 11:06 AssadHashmi

Note: in a recent debug run code_api|linux.signal_race did not fail, so it is at least flaky.

In a recent release run we observe code_api|linux.reset failing as well. In addition, code_api|linux.sigsuspend and code_api|linux.signal_race passed in that run.

Carrotman42 avatar Jun 25 '19 23:06 Carrotman42

I'm assuming these regressions have happened because we don't have AArch32 precommit testing in the same way we have for AArch64?

That's my assumption, but I wouldn't have expected the ARM encoder/decoder tests to have broken. Hopefully it's something simple.

derekbruening avatar Jun 28 '19 15:06 derekbruening

The log message for c76c48562f6781fd38b34b7e367c9221a5bd6f4d (Oct 2021) says:

Adds missing required-1 bits in the ARM encoding table entries for
OP_blx, OP_bx, and OP_bxj.  Without the bits, some hardware still
accepts the instructions (which is why we did not notice the problem
before), but they are technically unsound, and QEMU thinks they are
invalid, breaking some of our tests under QEMU.

That change is responsible for most of the api.disA32 failures, and perhaps that change should be reversed. It's my understanding that all hardware will accept those instructions: bits 8-19, which should be 1, will be ignored. Since these are branch instructions DynamoRIO really needs to decode and intercept them, I think. The problem with QEMU is perhaps a bug in QEMU. Perhaps it has even been fixed in the meantime.

The other api.dis failures are due to vadd.f32 and I have a patch for them coming soon, I hope.

egrimley-arm avatar Dec 09 '24 16:12 egrimley-arm

I did a bisection on common.segfault: it seems to have been broken by a8c4f6fbcb847f7dec80c4f86e3b7de33f332b27, and I can make it pass again on the head by changing RETCODE_SIZE in core/unix/signal_private.h from 16 to 8. However, I've seen unsigned long retcode[4] in a genuine kernel header file so maybe there is something else wrong that is somehow compensated for by the previous definition of RETCODE_SIZE.

egrimley-arm avatar Dec 09 '24 22:12 egrimley-arm

Yes, I did see unsigned long retcode[4] in a genuine kernel header file, but it was part of sigframe, not part of rt_sigframe. There seems to be no retcode in an AArch32 rt_sigframe. I will draft a PR to remove it.

(A spurious retcode of length 8 bytes happens to be harmless because there are a couple of items on the stack beyond the rt_sigframe so copying 8 bytes from after the rt_sigframe does not segfault. Copying any more does segfault, which is particularly confusing in this case as we're already in the process of attempting to handle a segfault.)

egrimley-arm avatar Dec 10 '24 14:12 egrimley-arm

In an arm32v7 Debian 10 Docker container under a 64-bit 5.4.47 kernel on Cortex-A72 hardware with DynamoRIO slightly modified (retcode removed from rt_sigframe) I got 114 out of 298 tests failing. One of the failing tests was drcacheoff.simple so I looked at that one under GDB. The stack trace looked like nonsense but the SIGBUS reported from an instance of ldrexd r0, r1, [r2] looked plausible: r2 contained an odd multiple of 4 but that instruction requires 8-byte alignment.

Perhaps it's obvious in hindsight was was happening but I had to do stuff like insert home-made asm volatile("b.w .\n\t.inst 65") breakpoints into the source to discover that the problem seemed to be one of the two instances (probably the first one) in tracer.cpp of

        placement = dr_global_alloc(MAX_INSTRU_SIZE);
        instru = new (placement) offline_instru_t(

dr_global_alloc returns a value with HEAP_ALIGNMENT, which is just 4 on ARM, but offline_instru_t uses some library lock function that assumes the standard 8-byte alignment and uses ldrexd.

So I changed HEAP_ALIGNMENT to 8, added a couple of ALIGN_FORWARDs in the definition of BLOCK_SIZES, also changed HEADER_SIZE to 8 and replaced STANDARD_HEAP_ALIGNMENT - HEAP_ALIGNMENT with 4 in redirect_malloc and commented out a few ASSERTs, and then I got only 72 out of 298 tests failing, so 42 tests were fixed by that hackery.

(In order to build the tests I also had to add a bogus definition of gettid to burst_syscall_inject.cpp.)

Question for @derekbruening: Is changing HEAP_ALIGNMENT to 8 on ARM the right approach here?

(Since DynamoRIO has been broken on 32-bit ARM hardware since 2019 there's perhaps not a huge amount of interest in it so we probably can't invest a huge amount of effort into it so an easy fix would be preferred!)

egrimley-arm avatar Dec 13 '24 09:12 egrimley-arm

Question for @derekbruening: Is changing HEAP_ALIGNMENT to 8 on ARM the right approach here?

Maybe that is the simplest rather than having to find all C++ uses and replacing with __wrap_malloc(). Hopefully the DR allocator code just works if the alignment define is changed...

derekbruening avatar Dec 13 '24 20:12 derekbruening

With 73b1df12b3484032045e07923668a7c834511469 plus a few changes that have not yet been merged, namely:

  • Changing HEAP_ALIGNMENT to 8 (see above)
  • Disabling DEBUG_ASSERT (#7132)
  • Fixing drreg.c (I have a draft PR)

in an AArch32 Debian 10 container under a 64-bit 5.4.47 Linux kernel running on Cortex-A72 I got 47 tests failed out of 298:

          4 - tool.drcacheoff.raw2trace_unit_tests (Failed)
          5 - tool.scheduler.unit_tests (Child aborted)
          8 - tool.drcacheoff.trace_interval_analysis_unit_tests (Failed)
          9 - tool.drcacheoff.analysis_unit_tests (Timeout)
         20 - code_api|linux.sigaction.native (Failed)
         85 - code_api|linux.sigmask (Failed)
         89 - code_api|linux.fib-conflict (Failed)
         90 - code_api|linux.fib-conflict-early (Failed)
         93 - code_api|linux.vfork (Failed)
         94 - code_api|linux.thread-reset (Failed)
        100 - code_api|pthreads.pthreads_fork_FLAKY (Failed)
        116 - code_api|client.detach_test (Timeout)
        120 - code_api|client.file_io (Failed)
        129 - code_api|client.drmgr-test (Failed)
        136 - code_api|client.drbbdup-drwrap-test (Failed)
        140 - code_api|client.drreg-test (Failed)
        141 - code_api|client.drreg-end-restore (Failed)
        154 - code_api|client.drutil-test (Failed)
        172 - code_api|sample.statecmp (Failed)
        194 - code_api|tool.drcachesim.delay-global (Failed)
        213 - code_api|tool.drcacheoff.raw-zlib (Failed)
        217 - code_api|tool.drcacheoff.multiproc (Timeout)
        225 - code_api|tool.drcacheoff.sysnums (Failed)
        229 - code_api|tool.drcacheoff.max-global (Failed)
        235 - code_api|tool.drcacheoff.windows-invar (Failed)
        253 - code_api|tool.drcacheoff.invariant_checker (Failed)
        254 - code_api|tool.drcacheoff.invariant_checker_pthreads (Failed)
        255 - code_api|tool.histogram.offline (Failed)
        256 - code_api|tool.record_filter (Failed)
        257 - code_api|tool.record_filter_bycore_uni (Failed)
        258 - code_api|tool.record_filter_bycore_multi (Failed)
        259 - code_api|tool.drcachesim.drstatecmp-delay-simple (Failed)
        260 - code_api|tool.drcacheoff.drstatecmp-delay-simple (Failed)
        263 - code_api|tool.drcacheoff.getretaddr_record_replace_retaddr (Failed)
        272 - code_api|client.drwrap-test-detach (Failed)
        274 - code_api|api.disA32 (Failed)
        276 - code_api|api.startstop (Failed)
        277 - code_api|api.detach (Failed)
        278 - code_api|api.detach_spawn_FLAKY (Failed)
        279 - code_api|api.detach_spawn_stress_FLAKY (Failed)
        289 - code_api|api.static_sideline_FLAKY (Failed)
        293 - code_api|api.thread_churn (Failed)
        294 - code_api|tool.drcacheoff.legacy (Failed)
        295 - code_api|tool.drcacheoff.func_view_noret (Failed)
        296 - code_api|tool.drcacheoff.altbindir (Failed)
        297 - code_api|tool.drcacheoff.legacy-int-offs (Failed)
        298 - no_code_api,no_intercept_all_signals|linux.sigaction (Failed)

egrimley-arm avatar Dec 20 '24 14:12 egrimley-arm

These are the tests that consistently failed on AArch32 with 632f8d1d8ee2eb9fd6a3027d83cea6845d8d66b2 (2025-01-30) on Debian 10 (released in 2019):

	  4 - tool.drcacheoff.raw2trace_unit_tests (Failed)
	  5 - tool.scheduler.unit_tests (Child aborted)
	  8 - tool.drcacheoff.trace_interval_analysis_unit_tests (Failed)
	 10 - tool.drcachesim.decode_cache_test (SEGFAULT)
	 91 - code_api|linux.fib-conflict (Failed)
	 92 - code_api|linux.fib-conflict-early (Failed)
	102 - code_api|pthreads.pthreads_fork_FLAKY (Failed)
	122 - code_api|client.file_io (Failed)
	131 - code_api|client.drmgr-test (Failed)
	138 - code_api|client.drbbdup-drwrap-test (Failed)
	142 - code_api|client.drreg-test (Failed)
	143 - code_api|client.drreg-end-restore (Failed)
	156 - code_api|client.drutil-test (Failed)
	174 - code_api|sample.statecmp (Failed)
	196 - code_api|tool.drcachesim.delay-global (Failed)
	227 - code_api|tool.drcacheoff.sysnums (Failed)
	231 - code_api|tool.drcacheoff.max-global (Failed)
	237 - code_api|tool.drcacheoff.windows-invar (Failed)
	256 - code_api|tool.drcacheoff.invariant_checker (Failed)
	257 - code_api|tool.drcacheoff.invariant_checker_pthreads (Failed)
	258 - code_api|tool.histogram.offline (Failed)
	259 - code_api|tool.record_filter (Failed)
	260 - code_api|tool.record_filter_bycore_uni (Failed)
	261 - code_api|tool.record_filter_bycore_multi (Failed)
	262 - code_api|tool.drcachesim.drstatecmp-delay-simple (Failed)
	263 - code_api|tool.drcacheoff.drstatecmp-delay-simple (Failed)
	266 - code_api|tool.drcacheoff.getretaddr_record_replace_retaddr (Failed)
	275 - code_api|client.drwrap-test-detach (Failed)
	277 - code_api|api.disA32 (Failed)
	279 - code_api|api.startstop (Failed)
	280 - code_api|api.detach (Failed)
	281 - code_api|api.detach_spawn_FLAKY (Failed)
	282 - code_api|api.detach_spawn_stress_FLAKY (Failed)
	292 - code_api|api.static_sideline_FLAKY (Failed)
	296 - code_api|api.thread_churn (Failed)
	297 - code_api|tool.drcacheoff.legacy (Failed)
	298 - code_api|tool.drcacheoff.func_view_noret (Failed)

In addition, the following tests seemed to fail consistently on one of the systems I tested on:

	 79 - code_api|linux.signal_race (Failed)
	 96 - code_api|linux.thread-reset (Failed)
	118 - code_api|client.detach_test (Timeout)
	219 - code_api|tool.drcacheoff.multiproc (Timeout)

I tested the above in Docker containers on two different AArch64 systems with different hardware and kernel versions. Some notes for anyone wanting to reproduce this:

  • On most AArch64 systems you can run an AArch32 Docker container with: docker run --privileged arm32v7/debian:10 linux32 /bin/bash
  • If you forget the linux32 the build fails with errors that include error: unknown type name '__uint128_t' and Error: ARM register expected.
  • If you forget the --privileged and don't use --security-opt seccomp=unconfined instead then two linux.sigaction tests fail because Docker by default disables certain system calls. Another test works around the absence of those system calls so passes too easily in a default Docker container so it is better to use one of those options when testing.
  • tool.drcacheoff.analysis_unit_tests sometimes needs many attempts before it passes.

egrimley-arm avatar Feb 03 '25 14:02 egrimley-arm