design Alignment hints and memory offsets

How to practically use alignment hints? Should storing something to location x with a particular alignment translate that address differently? Is there an engine that does that?

Should there be an instruction to find correctly aligned offset, for example something like get_aligned <offset> <alignment>, which would return the first aligned by <alignment> location starting with <offset> - this can be used to implement alignment-aware maloc and memalign in a libc for Wasm.

Dec 19 '19 19:12 penzn

SpiderMonkey uses the alignment hints on 32-bit ARM at the moment (and I think on MIPS). If the alignment hint says that the pointer may be unaligned, then we emit byte loads and stores instead of potentially unaligned word loads and stores. We still have to handle unaligned loads and stores that don't declare themselves as potentially unaligned, but the reasoning is that that handling may be expensive (signal handling) and that there's a performance benefit on some systems to declaring poor alignment if you know about it.

(We're probably going to stop doing this on ARM because by and large ARM systems now handle unaligned accesses without signals.)

Dec 20 '19 07:12 lars-t-hansen

I guess I assumed that when the backing memory for a Wasm memory operation is allocated, it would be allocated at a 16-byte aligned boundary so that any Wasm aligned load/stores would fit into this alignment. What I didn't understand is that the memory alignment seems to be just a hint that could be misused: there is no real use of the memory alignment value in the spec's semantics or high-level description and validation only checks that memarg.align it is not too large. That means non-aligned addresses with a memory-alignment hint could make it to codegen, with bad results: imagine two v128.store instructions, one to address 0 and the other to address 1, both with a 16-byte alignment hint. In my assumed "memory-is-aligned-to-16-bytes" scheme, due to the memory alignment hint I would assume I could generate the x86 MOVAPS (an aligned move) to move the vector; at compile-time I would not know what addresses could potentially be handed to the v128.store. At runtime, the v128.store to 0 would succeed and the v128.store to 1 would crash with a general-protection exception.

That makes me think that:

the scheme I had assumed, "memory-is-aligned-to-16-bytes", could crash unless we check at runtime that the addresses used actually conform to the memory alignment of the generated assembly
or we expect the memory alignment hint to be less of a "we'll try to align if we can" hint and more of a "this could crash your program if you use it incorrectly" hint
or we implement something like @penzn suggested, though I think if we had an instruction that returned aligned addresses then we wouldn't need the memory alignment hint on load/store at all.

Dec 20 '19 18:12 abrown

@abrown, in your hypothetical scenario that results in a general-protection fault, the engine would be responsible for handling that fault and performing the v128.store via some other slow path, as @lars-t-hansen mentioned. So yes, there could be a severe performance penalty for getting the alignment hint wrong, but there would never be a visible change in behavior and in particular the user will never observe a crash.

Dec 20 '19 18:12 tlively

As the engine is responsible for code generation, the engine can decide what the most optimal code generation path is. We're still working through some alignment issues in the V8 engine for SIMD code, but in general - if movaps or any other instruction needing 16-byte alignment is being generated, it is up to the engine to guarantee that the data is aligned correctly.

For example, the engine can mandate 16-byte aligned memory if needed, but this can be done differently in different engines, and should not be enforced in the Spec because this is not common to all hardware. The engine has control over the addresses of memory operands, the code being generated, and the alignment requirements of the instructions in code generation, so using the movaps example from above, it should be possible to ensure that a movaps instruction is not being generated at address 1.

Dec 20 '19 20:12 dtig

If the alignment hint says that the pointer may be unaligned, then we emit byte loads and stores instead of potentially unaligned word loads and stores.

@lars-t-hansen how do you check that, by looking at the offset or by simply checking if the hint value is less than what the hardware instruction requires?

The engine has control over the addresses of memory operands, the code being generated, and the alignment requirements of the instructions in code generation, so using the movaps example from above, it should be possible to ensure that a movaps instruction is not being generated at address 1.

@dtig I see that one can pick hardware instruction with particular alignment based on, but going beyond that would mean non-trivial address translation for Wasm memory. As @zeux pointed out in https://github.com/WebAssembly/simd/issues/162#issuecomment-568058421 I am not sure it would be practical from codegen point of view to do anything other than picking aligned or non-aligned memory access instructions.

Would it be beneficial to have an instruction to find Wasm memory offset with a given alignment?

Dec 21 '19 00:12 penzn

the engine would be responsible for handling that fault and performing the v128.store via some other slow path

@tlively, can you point me to how this is done in LLVM, e.g.? It seems not trivial. (Otherwise, as @zeux pointed out in the other thread, a runtime branch based on whether the address is aligned makes me lean toward always using MOVUPS, in which case the alignment hints do not seem useful).

Dec 22 '19 01:12 abrown

@abrown, this would not be done in LLVM, but rather in an engine like Spidermonkey, as @lars-t-hansen mentioned above.

Dec 22 '19 02:12 tlively

So LLVM can only ever emit a MOVUPS when it sees a v128.store? It can't use the alignment hint to generate a MOVAPS?

Dec 22 '19 02:12 abrown

There’s no upstream frontend for WebAssembly in LLVM, so you’d have to check out how Wasmer or WAVM lower WebAssembly to LLVM IR. My guess is that they pessimistically throw away all alignment information. And if they don’t, that’s a bug because you’re right that catching signals and patching code are not supported natively by LLVM.

Dec 22 '19 15:12 tlively

If the alignment hint says that the pointer may be unaligned, then we emit byte loads and stores instead of potentially unaligned word loads and stores.

@lars-t-hansen how do you check that, by looking at the offset or by simply checking if the hint value is less than what the hardware instruction requires?

The latter. The offset does not have enough information, since we care about the final ptr+offset alignment (the effective address), not the alignment of the pointer or the offset individually.

(For atomics, the situation is related but different. Wasm requires atomic accesses to be aligned because a lot of hardware does, and absent an analysis that can prove that the EA is aligned we must check the alignment of the effective address at runtime, and trap if it is not.)

Dec 22 '19 16:12 lars-t-hansen

If the alignment hint says that the pointer may be unaligned, then we emit byte loads and stores instead of potentially unaligned word loads and stores.

@lars-t-hansen how do you check that, by looking at the offset or by simply checking if the hint value is less than what the hardware instruction requires?

The engine has control over the addresses of memory operands, the code being generated, and the alignment requirements of the instructions in code generation, so using the movaps example from above, it should be possible to ensure that a movaps instruction is not being generated at address 1.

@dtig I see that one can pick hardware instruction with particular alignment based on, but going beyond that would mean non-trivial address translation for Wasm memory. As @zeux pointed out in WebAssembly/simd#162 (comment) I am not sure it would be practical from codegen point of view to do anything other than picking aligned or non-aligned memory access instructions.

IIUC, picking aligned memory accesses based on the alignment hints is the purpose of these hints because accesses aligned to natural alignment are supposed to be fast accesses. So to me this sounds like if an engine decides to optimize for this, it is WAI?

Would it be beneficial to have an instruction to find Wasm memory offset with a given alignment? I'm not sure what the benefits of such an instruction would be. Could you elaborate?

The way I see an instruction like this being used, is to use the return value of this offset into a subsequent memory operation to guarantee an aligned access, but this seems more expensive than just emitting unaligned accesses, or a runtime branch that validates an alignment hint.

Dec 23 '19 20:12 dtig

@lars-t-hansen how do you check that, by looking at the offset or by simply checking if the hint value is less than what the hardware instruction requires?

The latter. The offset does not have enough information, since we care about the final ptr+offset alignment (the effective address), not the alignment of the pointer or the offset individually.

I think it would be still possible end up with a situation when hardware instruction expects aligned address, while effective address is not aligned. Would this lead to a trap?

The way I see an instruction like this being used, is to use the return value of this offset into a subsequent memory operation to guarantee an aligned access, but this seems more expensive than just emitting unaligned accesses, or a runtime branch that validates an alignment hint.

Yep, that would require tracking what effective address is aligned on, and should only be used on allocations. Depending on how often allocations happen versus how often unaligned reads happen it might be cheaper than taking faults. Tracking effective address alignment on Wasm loads and stores would be much more expensive though.

My dilemma is that new constructs to find aligned locations are not necessarily worth it, as hardware is getting better at unaligned memory accesses, but if hardware is getting better, should we even have an alignment hint?

Dec 24 '19 08:12 penzn

Seems like the current spec gives implementers only 2 options:

Always emit unaligned loads and ignore the alignment. Bad if you still care about hardware where aligned instructions are faster or smaller.
Emit aligned, but trap and retry (possibly with patch). To me, Wasm should not have surprising performance cliffs, and this will have one if patching is not possible. It may cause people to generate Wasm modules with "bugs" in them that are being masked by the runtime. Even with patching this seems a rather inelegant solution, and indicative that the spec is choosing to be "correct" at the cost of real world concerns / pragmatism.

I had hoped that the spec would allow an implementation to terminate, which would be most useful to the programmer, to be able to fix their code. Though I see that is not easy either, since we wouldn't want to penalize architectures that never trap on unaligned reads.

If that is somehow impossible, I vote we remove alignment from the spec entirely (and deprecate its bits in the binary format), and make for a predictably unaligned world :P

Jan 02 '20 20:01 aardappel

@aardappel

I had hoped that the spec would allow an implementation to terminate, which would be most useful to the programmer, to be able to fix their code. Though I see that is not easy either, since we wouldn't want to penalize architectures that never trap on unaligned reads.

I'd expect that the default behavior of generic implementations should be to act consistently with other implementations (including not terminating on things that other platforms may not efficiently detect), but an implementation could add a nonstandard strict mode that developers could use in development that terminates on issues like misaligned loads. (The strict mode could also do further sanity checks at runtime like Valgrind.) Regular implementations could also choose to do something in their default behavior that doesn't affect behavior like logging a warning to the console when a misaligned load when on a platform that efficiently supports checking for that.

Jan 02 '20 23:01 Macil

I had hoped that the spec would allow an implementation to terminate, which would be most useful to the programmer, to be able to fix their code. Though I see that is not easy either, since we wouldn't want to penalize architectures that never trap on unaligned reads.

We can try to enforce alignment on allocation, using an instruction returning correctly aligned index, which can be turned into a NOP by the engine for situations when unaligned access is acceptable. The instruction can take a an integer for starting index and return an integer for first aligned index (same or greater value), in NOP mode the value just stays the same.

If that is somehow impossible, I vote we remove alignment from the spec entirely (and deprecate its bits in the binary format), and make for a predictably unaligned world :P

Agree, if the field cannot be effectively leveraged, it is worth considering removing it.

Jan 04 '20 01:01 penzn

The alignment hint was added to the spec with the assumption that compilers would almost always be able to emit it correctly. Alignment hints are always correct in all code produced by LLVM at least, barring bugs in LLVM or UB in user code.

If this assumption holds, it supports the strategy of VMs trusting the alignment hint, and then relying on very-slow fixups (eg. trap handlers) to fix things when the alignment hint is wrong, because it'll almost never be wrong.

Jan 07 '20 21:01 sunfishcode

Sorry for my ignorance, but I'm struggling to understand exactly how this works with misaligned operations.

Can the engine more efficiently load an i32 from an (effective) address of 2 if we tell it that the alignment is 16-bit? Or would it be enough to just tell the engine that the operation is misaligned?

It seems like WAT (not Wasm) only needs an optimistic hint (very-probably-aligned) and a pessimistic hint (very-probably-misaligned), leaving the engine to adopt its own strategy when the hint is omitted.

An aligned operation would infer its alignment from the value being loaded or stored in memory, so i32.aligned_load16 would have a 16-bit alignment.

Sorry if I've completely misunderstood how this works.

Sep 04 '20 23:09 carlsmith

@carlsmith in most cases a bool would have sufficed, but there are cases where using a smaller than natural alignment makes sense, for example on 32-bit x86 loading a double from a 4-byte aligned address counts as "aligned", but 2-byte aligned would not.

"Hints" would not be good, because on some architectures (e.g. ARM) load instructions for aligned or unaligned are different, so if it can't be a 100% sure it would need to use unaligned loads, because the aligned load would trap on unaligned accesses (unlike x86).

Sep 07 '20 16:09 aardappel

@aardappel - Thanks for the explanation. I'm still not sure I get it though.

This would all be specific to WAT, not Wasm. The binary format would not need to change. The new WAT code that would look something like i32.aligned_load16 would compile to the same opcode and memarg as i32.load16 align=1.

It's actually ternary. You effectively have a bool that can also be null, mapping to three states: Aligned, Misaligned and No Specified Alignment.

Given backwards compatibility, there's no practical likelihood of WAT ever losing the ability to specify an alignment in the memarg. This is more a question of whether WAT could specify alignments just as well with a ternary value. If there are edge cases, the current syntax would still work in practice, but is there actually a need for it?

On hints: I thought the spec requires that load and stores work no matter whether the specified alignment is correct or not (only accepting worse performance), so while they are expected to always be correct in practice, they are semantically hints.

WAT only assembles to Wasm bytecode, so loading a double on a specific architecture seems like something the Engine would always be responsible for figuring out, though I may have misunderstood that point.

Thanks again for taking the time.

Sep 08 '20 04:09 carlsmith

@carlsmith I was explaining why the way it works in Wasm makes sense, and why you might want that level of control. I wasn't talking about the text format, though I don't see why we would want to deviate from that representing the binary format as closely as possible.

Sep 10 '20 16:09 aardappel

Fair enough. Keeping close to the binary format is a good reason.

Sep 11 '20 04:09 carlsmith

The questions here appear answered; please file new issues if there are any further questions!

Oct 28 '22 23:10 sunfishcode