rfcs icon indicating copy to clipboard operation
rfcs copied to clipboard

[RFC] AtomicPerByte (aka "atomic memcpy")

Open m-ou-se opened this issue 3 years ago • 118 comments

Rendered

m-ou-se avatar Aug 14 '22 17:08 m-ou-se

cc @ojeda

bjorn3 avatar Aug 14 '22 17:08 bjorn3

This could mention the atomic-maybe-uninit crate in the alternatives section (cc @taiki-e).

ibraheemdev avatar Aug 14 '22 19:08 ibraheemdev

With some way for the language to be able to express "this type is valid for any bit pattern", which project safe transmute presumably will provide (and that exists in the ecosystem as bytemuck and zerocopy and probably others), I'm wondering if it would be better to return an AtomicPerByteRead<T>(MaybeUninit<T>) which we/the ecosystem could provide a safe into_inner (returning a T) if T is valid for any bit pattern.

This would also require removing the safe uninit method. But you could always presumably do an AtomicPerByte<MaybeUninit<T>> with no runtime cost to passing MaybeUninit::uninit() to new.

That's extra complexity, but means that with some help from the ecosystem/future stdlib work, this can be used in 100% safe code, if the data is fine with being torn.

5225225 avatar Aug 14 '22 19:08 5225225

The "uninit" part of MaybeUninit is essentially not a bit pattern though. That's the problem. Even if a value is valid "for all bit patterns", you can't unwrap uninit memory into that type.

not without the fabled and legendary Freeze Intrinsic anyway.

Lokathor avatar Aug 14 '22 20:08 Lokathor

On the other hand, AnyBitPatternOrPointerFragment isn't a type we have, nor really a type we strictly need for this. Assuming tearing can't deinitialize initialized memory, then MaybeUninit would suffice I think?

T-Dark0 avatar Aug 14 '22 20:08 T-Dark0

note that LLVM already implements this operation: llvm.memcpy.element.unordered.atomic Intrinsic with an additional fence operation for acquire/release.

programmerjake avatar Aug 15 '22 01:08 programmerjake

The trouble with that intrinsic is that unordered is weaker than monotonic aka Relaxed, and it can't easily be upgraded. There's no "relaxed fence" if the ordering you want is Relaxed; and even if the ordering you want is Acquire or Release, combining unordered atomic accesses with fences doesn't produce quite the same result. Fences provide additional guarantees regarding other memory accessed before/after the atomic access, but they don't do anything to restore the missing "single total order" per address of the atomic accesses themselves.

comex avatar Aug 15 '22 02:08 comex

The trouble with that intrinsic is that unordered is weaker than monotonic aka Relaxed, and it can't easily be upgraded. There's no "relaxed fence" if the ordering you want is Relaxed; and even if the ordering you want is Acquire or Release, combining unordered atomic accesses with fences doesn't produce quite the same result. Fences provide additional guarantees regarding other memory accessed before/after the atomic access, but they don't do anything to restore the missing "single total order" per address of the atomic accesses themselves.

I'm very familiar with the standard Rust and C++ memory orderings, but I don't know much about llvm's unordered ordering. Could you give an example of unexpected results we might get if we were to implement AtomicPerByte<T>::{read, write} using llvm's unordered primitive and a fence? Thanks!

(It seems monotonic is behaves identically to unordered for loads and stores?)

m-ou-se avatar Aug 15 '22 08:08 m-ou-se

cc @ojeda

Thanks! Cc'ing @wedsonaf since he will like it :)

ojeda avatar Aug 15 '22 10:08 ojeda

Unordered is not monotonic (as in, it has no total order across all accesses), so LLVM is free to reorder loads/stores in ways it would not be allowed to with Relaxed (it behaves a lot more like a non-atomic variable in this sense)

In practical terms, in single-thread scenarios it behaves as expected, but when you load an atomic variable with unordered where the previous writer was another thread, you basically have to be prepared for it to hand you back any value previously written by that thread, due to the reordering allowed.

Concretely, I don't know how we'd implement relaxed ordering by fencing without having that fence have a cost on weakly ordered machines (e.g. without implementing it as an overly-strong acquire/release fence).

That said, I think we could add an intrinsic to LLVM that does what we want here. I just don't think it already exists.

(FWIW, another part of the issue is that this stuff is not that well specified, but it's likely described by the "plain" accesses explained in https://www.cs.tau.ac.il/~orilahav/papers/popl17.pdf)

thomcc avatar Aug 15 '22 17:08 thomcc

CC @RalfJung who has stronger opinions on Unordered (and is the one who provided that link in the past).

I think we can easily implement this with relaxed in compiler-builtins though, but it should get a new intrinsic, since many platforms can implement it more efficiently.

thomcc avatar Aug 15 '22 19:08 thomcc

We already have unordered atomic memcpy intrinsics in compiler-builtins. For 1, 2, 4 and 8 byte access sizes.

bjorn3 avatar Aug 15 '22 20:08 bjorn3

I'm not sure we'd want unordered, as mentioned above...

thomcc avatar Aug 15 '22 20:08 thomcc

To clarify on the difference between relaxed and unordered (in terms of loads and stores), if you have

static ATOM: AtomicU8 = AtomicU8::new(0);
const O: Ordering = ???;

fn thread1() {
    ATOM.store(1, O);
    ATOM.store(2, O);
}

fn thread2() {
    let a = ATOM.load(O);
    let b = ATOM.load(O);
    assert!(a <= b);
}

thread2 will never assert if O is Relaxed, but it could if O is (the hypothetical) Unordered.

In other words, for unordered, it would be legal for 2 to be stored before 1, or for b to be loaded before a. In terms of fences, there's no fence that "upgrades" unordered to relaxed, although I believe (but am not certain) that stronger fences do apply to it.

thomcc avatar Aug 16 '22 02:08 thomcc

something that could work but not be technically correct is: compiler acquire fence unordered atomic memcpy compiler release fence

those fences are no-ops at runtime, but prevent the compiler from reordering the unordered atomics -- assuming your on any modern cpu (except Alpha iirc) it will behave like relaxed atomics because that's what standard load/store instructions do.

programmerjake avatar Aug 16 '22 03:08 programmerjake

Those fences aren't always no-ops at runtime, they actually emit code on several platforms (https://github.com/rust-lang/rust/issues/62256). It's also unclear what can and can't be reordered across compiler fences (https://github.com/rust-lang/unsafe-code-guidelines/issues/347), certainly plain stores can in some cases (this is easy to show happening in godbolt).

Either way, my point has not been that we can't implement this. We absolutely can and it's probably even straightforward. My point is just that I don't really think those existing intrinsics help us do that.

thomcc avatar Aug 16 '22 03:08 thomcc

I like MaybeAtomic, but following C++ with AtomicPerByte sounds reasonable. The LLVM guys started something similar in 2016: https://reviews.llvm.org/D27133

tschuett avatar Aug 18 '22 20:08 tschuett

CC @RalfJung who has stronger opinions on Unordered (and is the one who provided that link in the past).

Yeah, I don't think we should expose Unordered to users in any way until we are ready and willing to have our own concurrency memory model separate from that of C++ (or until C++ has something like unordered, and it's been shown to also make sense formally). There are some formal memory models with "plain" memory accesses, which are similar to unordered (no total mo order but race conditions allowed), but I have no idea if those are an accurate model of LLVM's unordered accesses. Both serve the same goal though, so there's a high chance they are at least related: both aim to model Java's regular memory accesses.

We already have unordered atomic memcpy intrinsics in compiler-builtins. For 1, 2, 4 and 8 byte access sizes.

Well I sure hope we're not using them in any way that actually becomes observable in program behavior, as that would be unsound.

RalfJung avatar Aug 20 '22 16:08 RalfJung

For specific types the load granularity can be larger than a byte and we will likely want to take advantage of that, so the name AtomicPerByte might not fit well.

ibraheemdev avatar Aug 20 '22 17:08 ibraheemdev

@ibraheemdev that is covered by the as-if rule. It is impossible for the program to tell if the load is truly per-byte, of if adjacent reads are merged into 2-byte reads (or larger).

RalfJung avatar Aug 20 '22 17:08 RalfJung

I think it's worth mentioning (if for nothing less than for it to have been mentioned) that an alternative to address the core problem of "how can we implement a seqlock?" is to make racy reads/writes on MaybeUninit act more like LLVM's undef (for lack of a better comparison). That is, instead of a racy read of a MaybeUninit<T> being instant global ub, it simply gets an undefined value (which is valid for MaybeUninit) and the UB only "happens" (not sure of better terminology) if the undefined value is ever observed.
This would allow something very close to how a seqlock is classically handled:

pub struct SeqLock<T> {
    seq: AtomicUsize,
    data: UnsafeCell<MaybeUninit<T>>,
}

unsafe impl Sync<T: Copy + Send> for SeqLock<T> {}

impl<T: Copy> SeqLock<T> {
    /// Safety: Only call from one thread.
    pub unsafe fn write(&self, value: T) {
        self.seq.fetch_add(1, Relaxed);
        self.data.write(MaybeUninit::new(value));
        self.seq.fetch_add(1, Release);
    }

    pub fn read(&self) -> T {
        loop {
            let s1 = self.seq.load(Acquire);
            let data = unsafe { self.data.read() };
            let s2 = self.seq.load(Relaxed);
            if s1 & 1 == 0 && s1 == s2 {
                return unsafe { data.assume_uninit() };
            }
        }
    }
}

This has the nice advantages of not having to figure out atomics or how llvm's unordered memcpy actually works while also not actually inuring the performance penalty of having to do a literal bytewise atomic memcpy which can be significantly slower than a more tuned implementation.
Again, not saying that this is strictly the best option, I just think that it's a fairly useful alternative and at the very least worth drawing attention to

Kixiron avatar Aug 23 '22 16:08 Kixiron

@Kixiron That wouldn't suffice, as Ralf explains here: https://github.com/rust-lang/rfcs/pull/3301#discussion_r950713536

m-ou-se avatar Aug 23 '22 16:08 m-ou-se

It wouldn't suffice on its own, but I think it could be made to work by also placing a suitable fence after the data load and another fence in the write, or something like that?

I do agree that the LLVM semantics for data races are nicer than those of C++. Sadly, that model is also not very well explored, so for now I think it would be better to stick to C++ -- until we are confident enough to define our own concurrency memory model.

RalfJung avatar Aug 23 '22 22:08 RalfJung

It wouldn't suffice on its own, but I think it could be made to work by also placing a suitable fence after the data load and another fence in the write, or something like that?

Doesn't this still have the issue of plain RW not actually interacting with fences: https://github.com/rust-lang/rfcs/pull/3301#discussion_r950100411? A memory model where racy read is undefined value instead of undefined behaviour could work, but you'll still need read-dont-modify-write on the sequence number read

cbeuw avatar Aug 23 '22 22:08 cbeuw

you'll still need read-dont-modify-write on the sequence number read

Yes that is what I meant.

RalfJung avatar Aug 23 '22 22:08 RalfJung

Most of the discussion so far is about the general concept and implementability of an "atomic per byte memcpy", but I'd also like some feedback on the API design itself. E.g. does the usage of MaybeUninit make sense, or should we have a separate MaybeTeared or something like that? Is it ergonomic enough, especially for slices? Etc.

m-ou-se avatar Aug 25 '22 07:08 m-ou-se

I think the fact that the reason the discussion has focused on the implementability is that the API is pretty good for the use case. I certainly have no objections to it.

thomcc avatar Aug 25 '22 17:08 thomcc

My only API concern is the asymmetry around reads returning MaybeUninit<T> but writes taking T. But store_from seems to cover that well enough, and the asymmetry might also serve as a speedbump for people to go read the docs and figure out what is happening.

RalfJung avatar Aug 26 '22 21:08 RalfJung

Most of the discussion so far is about the general concept and implementability of an "atomic per byte memcpy", but I'd also like some feedback on the API design itself. E.g. does the usage of MaybeUninit make sense, or should we have a separate MaybeTeared or something like that? Is it ergonomic enough, especially for slices? Etc.

It does seem kind of odd to me that unsafe code is needed to unpack the result of a byte-wise load even for types where that is actually always safe because any correctly sized stream of bytes is a valid representation.

For a caricatural example, I do not see why I should need unsafe code to load the contents of an AtomicPerByte<[u8; LENGTH]> into a &mut [u8; LENGTH], given that this operation is UB-free by definition.

In the actual use case that I have in mind (well, more exactly a client of that lib), it would be convenient to load an &[f32] byte-wise into an &mut [f32], which should be safe as well since any bag of 4 bytes is a valid f32 (though possibly a surprising one).

Basically, I second comment https://github.com/rust-lang/rfcs/pull/3301#issuecomment-1214439068 regarding the idea that any type that allows safe-transmute from a suitably sized stream of bytes, should be safe to load from AtomicPerByte storage. MaybeUninit does not model that guarantee well as it's designed to also handle uninitialized memory, which this abstraction cannot emit.

HadrienG2 avatar Sep 13 '22 13:09 HadrienG2

It does seem kind of odd to me that unsafe code is needed to unpack the result of a byte-wise load even for types where that is actually always safe because any correctly sized stream of bytes is a valid representation.

We have to be careful here, because not every stream of bytes is valid for [u8; N] -- if any of the bytes is uninitialized, that would be invalid. (So e.g. transmuting (u8, u16) to u32 or to [u8; 4] is not sound.)

However, [u8; N] has the property that mixing the bytes of any two or more valid instances, yields a valid instance. [bool; N] also has that property. But [NonZeroU32; N] does not. For the types that have that property we could offer an API that skips the MaybeUninit.

RalfJung avatar Sep 18 '22 12:09 RalfJung