stdarch icon indicating copy to clipboard operation
stdarch copied to clipboard

non-temporal stores: use inline assembly

Open RalfJung opened this issue 1 year ago • 4 comments

LLVM treats !nontemporal as just a hint on store operations, which is unsound -- they have a totally different semantics, similar to atomic memory orderings. So I'd like to avoid any risk of that causing any issues by entirely avoiding their !nontemporal attribute. Is it acceptable to use inline assembly to implement these intrinsics?

Note that this is my first time ever writing inline assembly, so the code may or may not make any sense.^^

RalfJung avatar Feb 25 '24 16:02 RalfJung

r? @Amanieu

rustbot has assigned @Amanieu. They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

rustbot avatar Feb 25 '24 16:02 rustbot

My understanding is that LLVM can turn a nontemporal store into a normal one, but not the other way around. This seems to be fine as far as I understand.


The CI failure happens because the target_feature attribute only enables sse2 and rustc isn't smart enough to figure out that this implies sse (only LLVM knowns that). You fix it by enabling the sse feature as well.

Amanieu avatar Feb 25 '24 16:02 Amanieu

My understanding is that LLVM can turn a nontemporal store into a normal one, but not the other way around. This seems to be fine as far as I understand.

It's completely unclear. LangRef talks about it as a hint:

The optional !nontemporal metadata must reference a single metadata name <nontemp_node> corresponding to a metadata node with one i32 entry of value 1. The existence of the !nontemporal metadata on the instruction tells the optimizer and code generator that this load is not expected to be reused in the cache. The code generator may select special instructions to save cache bandwidth, such as the MOVNT instruction on x86.

That would mean the flag can be added or removed arbitrarily ("this load is not expected to be reused in the cache" -- but no semantic constraints or anything). But that's clearly wrong. LLVM doesn't acknowledge in the slightest the extra UB that can be caused by non-temporal stores (https://github.com/llvm/llvm-project/issues/64521). Therefore I have zero confidence that anyone thought about how !nontemporal interacts with all the LLVM passes that work on load (almost all of which probably just ignore the attribute entirely). I'm not even aware of any cross-platform memory model with support for nontemporal stores that they could be using here -- and they clearly need a cross-platform memory model since they are doing optimizations in the context of a C++11-style model.

RalfJung avatar Feb 25 '24 16:02 RalfJung

SDE ERROR:  TID: 1064 executed instruction with an unaligned memory reference to address 0x7f27229035e0 INSTR: 0x562d8a5e21f3: IFORM: VMOVNTPS_MEMf32_ZMMf32_AVX512 :: vmovntps zmmword ptr [rax], zmm0
	IMAGE:    /checkout/target/x86_64-unknown-linux-gnu/release/deps/core_arch-59198cd2fc79a24a
	FUNCTION: _ZN9core_arch9core_arch3x867avx512f5tests20test_mm512_stream_ps20test_mm512_stream_ps17hb7f0b28acc824410E.llvm.13799798511543115899
	FUNCTION ADDR: 0x562d8a5e21c0

Hm, yes, this requires alignment, but that shouldn't be new...?

RalfJung avatar Feb 25 '24 17:02 RalfJung

I think I found where LLVM defines the x86 intrinsics: https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/IR/IntrinsicsX86.td.

I found nothing with "stream" in the name, and the only "movnt" is int_x86_mmx_movnt_dq, probably accessible via llvm.x86.mmx.movnt.dq, which I assume is not the right thing.

There seem to be already quite a few asm! in stdarch so I guess using that here is acceptable? IMO it's better than just using normal loads since presumably people actually want the streaming semantics when using this operation.

RalfJung avatar Jun 21 '24 12:06 RalfJung

Awesome, thanks. :)

After the next stdarch bump we can then remove the intrinsic from rustc.

RalfJung avatar Jun 21 '24 15:06 RalfJung

The intrinsic is only broken on x86, it still has value on other targets.

Amanieu avatar Jun 21 '24 16:06 Amanieu

Hm, fair. Maybe we should then document the intrinsic as "it is semantically equivalent to a regular load, just a hint", and on x86 actually compile it to just a load since that architecture doesn't have a "just a hint" version of this. For all other architectures we'd have to check whether what LLVM does there is sensible or not.

RalfJung avatar Jun 21 '24 21:06 RalfJung

@Amanieu any chance we could get a stdarch bump in the rustc repo that includes this change? :)

RalfJung avatar Jul 13 '24 16:07 RalfJung

We're waiting on a bootstrap bump that should happen next week.

Amanieu avatar Jul 13 '24 17:07 Amanieu