threads icon indicating copy to clipboard operation
threads copied to clipboard

Misalignment checks

Open lachlansneff opened this issue 6 years ago • 9 comments

According to the current overview for the threads proposal, misaligned atomic accesses trap (here), while regular, misaligned memory accesses still don't trap.

From the research I've done, it looks like using cpu-exceptions to support trapping just for misaligned atomic operations is a no go for several reasons (lack of os support, overhead, etc). Are we intended to emit a branch to check if the pointer is aligned?

lachlansneff avatar Feb 05 '19 08:02 lachlansneff

Firefox currently emits an explicit alignment check for atomics. I'm guessing / predicting that these checks can be hoisted and optimized in the same way that explicit bounds checks are, but we're not doing that yet.

I don't think "overhead" is a good reason not to rely on the CPU to trap unaligned atomics if possible. After all we handle unaligned normal accesses that are not labeled as unaligned by emulating them in a trap handler on architectures where this is necessary (32-bit ARM, so far). IMO atomics are implicitly labeled as always-aligned and fall into the same bucket, except we won't emulate them.

I never investigated whether we could omit the alignment checks and rely on the CPU to trap; like optimizing the explicit alignment checks, it's something we can get to by and by. Judging from the ARMv8 manual, exclusive loads/stores trap reliably if not properly aligned, so at least on some platforms they will show up as bus errors.

lars-t-hansen avatar Feb 05 '19 12:02 lars-t-hansen

We also emit an alignment check for atomic accesses.

I think we may be stuck with the explicit branch because of the constraints we have, at least if I understand the constraints.

  • it's a non-goal to properly support atomic accesses to non-aligned addresses, because that doesn't seem feasible on today's hardware.
  • therefore we have to disallow them, since not being able to guarantee atomicity defeats their purpose for existence, and allowing them would introduce (even more) nondeterminism.
  • since we can't statically disallow them, we have to dynamically disallow them
  • even though hardware (is it all hardware?) can produce exceptions on misalignment, there are embedding constraints (OS, libraries) that may prevent use of the CPU capability in many situations.

After all, we handle unaligned normal accesses that are not labeled as unaligned by emulating them in a trap handler on architectures where this is necessary (32-bit ARM, so far).

We actually don't do this. Does Firefox support a wider set of ARM hardware than we do?

titzer avatar Feb 05 '19 12:02 titzer

We observed that some 32-bit ARM devices running Android would trap on unaligned floating point accesses, in fact once I started looking I didn't find any devices that didn't do that. Integer instructions haven't been an issue though.

lars-t-hansen avatar Feb 05 '19 13:02 lars-t-hansen

@titzer We actually can statically disallow them with a change to the wasm atomics spec. If all atomic instructions take a memory address that is the intended memory address shifted down by the number of lower bits that would be zeroed anyway if the address were aligned, then atomics are always aligned. It would be up to the guest language to ensure their alignment or cause (safe) UB.

lachlansneff avatar Feb 05 '19 16:02 lachlansneff

I see that, but that might cause other problems, since now the input is 32 user-controlled bits, shifted left by 1, 2, or 3 bits. Should the (implicit) bounds check ignore the upper bits or not? Either way, you introduce an implicit shift and potentially another comparison.

titzer avatar Feb 05 '19 17:02 titzer

Yes, the bounds check would have to mask out the top bits. That method definitely introduces other issues.

lachlansneff avatar Feb 05 '19 17:02 lachlansneff

I think the current compromise is ok. It means some targets need to explicitly check alignment, but a runtime can omit those checks on targets where misaligned atomics implicitly trap. It's harder for a runtime to elide explicit alignment checks that were emitted by a producer.

AndrewScheidecker avatar Feb 06 '19 12:02 AndrewScheidecker

I wonder if it's possible to change the behaviour from misalignment trapping to misaligned atomic access being not atomic at all to avoid the performance cost of checking for misalignment. This would allow architectures that trap to consider atomic accesses the same as regular accesses, while allowing architectures that don't care about alignment to have atomic accesses, which is fine by the as-if rule - after all, a valid program could replace all memory accesses with atomic memory accesses, and still work (even if much more slowly).

That said, atomic accesses alignment checks probably can be often optimized out, so it's probably not that big of a deal, and trapping is fine.

LunaBorowska avatar Feb 19 '20 07:02 LunaBorowska

Unfortunately the requirement that unaligned atomic memory accesses trap implies that an explicit alignment check would be pretty much mandatory (unless it could be optimized out) when targeting many current and probably almost all future implementations of the 64-bit Arm architecture (the A profile, to be precise). The reason is the FEAT_LSE2 architectural extension that enables unaligned atomic loads and stores as long as they lie completely within a 16-byte memory region that is aligned to 16 bytes; they generate a fault otherwise. This extension is a mandatory part of the Armv8.4-A revision of the architecture and as a result it is also required in Armv9-A and future versions. Some of the chips on the market that support it right now are Samsung Exynos 2200 in mobile devices, AWS Graviton3 in the server space, and others. As for processor cores, Arm Cortex-A510, Cortex-A710, Cortex-X2, and Neoverse V1 support it.

I wonder if it's possible to change the behaviour from misalignment trapping to misaligned atomic access being not atomic at all to avoid the performance cost of checking for misalignment.

This would be definitely helpful, but in the AArch64 case it means that the trap handlers would still have to emulate the remaining atomic memory accesses that trap.

akirilov-arm avatar Oct 13 '22 08:10 akirilov-arm