rfcs Stabilize volatile copy and set functions

Rendered

Stabilize the volatile_copy_memory, volatile_copy_nonoverlapping_memory and volatile_set_memory intrinsics as ptr::copy_volatile, ptr::copy_nonoverlapping_volatile and ptr::write_bytes_volatile, respectively.

I initiated discussion on this back in June here: https://internals.rust-lang.org/t/pre-rfc-stabilize-volatile-copy-and-set-functions/9872

Unfortunately, it fell off my plate, but I'd like to dust if off and move it forward.

Signed-off-by: Dan Cross [email protected]

Jul 20 '19 22:07 dancrossnyc

cc @rust-lang/wg-unsafe-code-guidelines

Jul 20 '19 23:07 Centril

Can't these new methods be implemented (in user code) in terms of the ones we are already exposing? If yes, the RFC should say why new APIs should be added to libstd, given the motivation is an extremely niche use-case. If no, the RFC should explain why the "obvious" implementation in terms of read_volatile and write_volatile is not correct.

EDIT: Ah, I think I found the answer in the "alternatives" section:

inelegant, less likely to perform well, and opens up the possibility of bugs

At least the first and third point can easily be taken care of by writing this in a crate once and auditing it properly. In terms of performance, compilers don't touch volatile accesses anyway (that's their point), so this seems a questionable argument to me.

Jul 21 '19 09:07 RalfJung

In terms of performance, compilers don't touch volatile accesses anyway (that's their point), so this seems a questionable argument to me.

Which is exactly why these intrinsics are needed. They are a single operation so there's nothing to optimized, but a hand written loop does need to be optimized, but can't as the writes are volatile.

Jul 22 '19 00:07 spunit262

@spunit262

Which is exactly why these intrinsics are needed. They are a single operation so there's nothing to optimized, but a hand written loop does need to be optimized, but can't as the writes are volatile.

The intrinsic is a single operation in the IR that lowers to lots of assembly, and likely pretty much the same assembly as a hand-coded loop. I see no reason why the intrinsic should perform any better. Without an experiment demonstrating a performance difference between using the intrinsic and using the open-coded version, I do not think you can reasonably claim that there will be any difference.

Jul 22 '19 08:07 RalfJung

@RalfJung what about unaligned volatile operations, right now {read, write}_volatile assume that the input pointer is aligned, and you would need copy_volatile at least to get unaligned volatile reads and writes and (copy_nonoverlapping_volatile to get efficient unaligned volatile reads and writes).

Jul 22 '19 14:07 RustyYato

While it's not properly documented (nothing about these intrinsics is), they do in fact lower to IR that requires the pointers to be aligned. This is also consistent with now the non-volatile equivalents, and pointer functions in general, work.

Furthermore, you can also get a version without alignment requirements (no matter whether it's an intrinsic or implemented in library code) by treating the memory as having n * size_of::<T>() elements of type u8 instead (or MaybeUninit<u8> to be on the safe side w.r.t. uninitialized memory). That does byte-sized accesses instead of T-sized ones (if the latter are possible at all on the hardware), but nothing about volatile_{copy,set}_memory (neither current docs, nor LLVM's docs, nor this RFC) guarantees the access size anyway, so if you're worried about tearing, those functions don't help you anyway.

Jul 22 '19 14:07 hanna-kruppe

@RalfJung I've already looked at the assembly, these intrinsics lower to a call to memset/memcpy/memmove or a small number of loads and stores.

Jul 23 '19 06:07 spunit262

In terms of performance, compilers don't touch volatile accesses anyway (that's their point), so this seems a questionable argument to me.

Which is exactly why these intrinsics are needed. They are a single operation so there's nothing to optimized, but a hand written loop does need to be optimized, but can't as the writes are volatile.

A volatile memcpy is a single operation in the IR, but it is still going to lower to N distinct reads and writes, which is the exact same thing a hand-coded performant volatile memcpy implementation would lower to.

Sep 24 '19 13:09 gnzlbg

We discussed this in the lang meeting today. The consensus was that since this is currently possible to be written using existing stable volatile things (https://github.com/rust-lang/rfcs/pull/2728/files#r305730848), that there's no lang impact here. As such, untagging lang and leaving to libs to decide whether this is worth doing.

Sep 30 '20 17:09 scottmcm

We discussed this in the lang meeting today. The consensus was that since this is currently possible to be written using existing stable volatile things (https://github.com/rust-lang/rfcs/pull/2728/files#r305730848), that there's no lang impact here. As such, untagging lang and leaving to libs to decide whether this is worth doing.

I hope what I said over there back then is correct, given that nobody responded there.^^

Sep 30 '20 17:09 RalfJung

You are talking about https://doc.rust-lang.org/stable/std/ptr/fn.write_volatile.html (stable since 1.9.0), aren't you?

Sep 30 '20 18:09 MikailBag

Yeah you can write memcopy_volatile using read_volatile and write_volatile. This is not a necessary intrinsic.

That said, most of the standard library isn't necessary, but we have it there anyway, and we should have this too.

Sep 30 '20 18:09 Lokathor

@Lokathor absolutely, I didn't mean that as an argument against adding these functions. I just want to be sure that we all agree that the new functions can be implemented in terms of already existing ones.

In fact I think it would be good to explicitly say in the docs how they could be implemented in terms of write_volatile. Something like "this function is equivalent to: [...]".

Sep 30 '20 19:09 RalfJung

(Oops, i was more saying that to MikailBag)

Sep 30 '20 19:09 Lokathor

Hi all, what's the status here? It seems like there were some outstanding points of feedback on the text of the RFC that could be addressed and some questions about whether it was appropriate for the standard library. Is that accurate or were there other concerns with offering this kind of functionality? For context: this has come up several times in implementing shared memory code for an OS, including most recently in a library which manages guest virtual machine memory.

Re: appropriateness, I've prepared a small example showing the differing output for intrinsic vs. user-written version. Of note is that LLVM lowers the volatile intrinsic to a memcpy (presumably with the guarantee that it won't be elided if inlined) which can take full advantage of wider registers while AFAICT the manually implemented loop can't be similarly optimized. I'm not sure I understand how a user library would get these optimizations without losing LLVM's non-elision guarantees for volatile operations, so I'm hoping that I'm either missing something there or that there's a path to offering stable intrinsic wrappers in std.

Jul 20 '21 23:07 anp

As far as I am concerned (for whatever that is worth), I think the main open question here is the general direction for our volatile operations. For most but not all use-cases, just saying "the access will not be eliminated" is insufficient, one also needs to know exactly to how many machine accesses of which size the operation will compile -- a 96-byte read_volatile will definitely not be a single load, but how exactly will it tear? Currently we don't specify this. We don't even specify any guarantees for when a volatile access will definitely not tear, even though this is crucial for the MMIO use-case.

When you talk about automatically using wider types, that sounds like it would change how large accesses tear, so it does touch on this question.

In the past there was a proposal to replace the type-generic read_volatile/write_volatile by operations that work more like our atomic instrinsics, which do guarantee to not introduce tearing. However this proposal was never pushed to completion. There might also be other approaches to solve this problem. But I do think that our current API is rather inadequate for the task it is designed to help with -- at the very least we need better documentation on when tearing can and cannot happen. (And I am not at all an expert on those rules, I can just note their glaring absence in the current docs.)

I'm a bit worried about extending our volatile story while such basic questions remain unanswered.

Jul 21 '21 19:07 RalfJung

Any sort of atomic-like system for volatile accesses would maybe ensure that there's no tearing, and definitely ruin the times of embedded developers who are currently using repr(transparent) to simply declare volatile accesses of more structured types than raw integers, and it would still be exactly as unsafe and device specific as what we have now. Overall, a big step backwards.

Separately, if we should have these new intrinsics, I don't have a strong opinion.

Ralf is certainly not wrong that the access width must be specified for these to be broadly useful. As one example, I work with a device where part of the mmio must be accessed with u8, and another part must not be accessed with u8, and then the rest doesn't care much. In the absence of info about what will happen with these intrinsics I'd just have to write my own loops anyway.

Jul 21 '21 20:07 Lokathor

volatile accesses of more structured types than raw integers

the access width must be specified for these to be broadly useful

How do you reconcile these two statements? Do developers using "structured types" for volatile access expect that a single access width is used for the whole structure? Or are they expecting that individual fields of the structure will be written with separate access widths?

If the former, then I don't really understand why switching to a system like we have for atomics would be a step backwards, since you can implement the "generic volatile copy" on top of eg. volatile 32-bit accesses, and get the same guarantees? If the latter, then isn't that code just wrong, since those guarantees are not provided?

Jul 21 '21 21:07 Diggsey

repr(transparent) wrapper structs are used over top of a raw integer type, and then the appropriate methods are provided for altering the bits within that integer. The volatile locations are then declared to directly be of these various wrapper types.

Jul 21 '21 22:07 Lokathor

The same would work on top of VolatileI32 etc types though, wouldn't it?

Jul 21 '21 22:07 RalfJung

I suppose I don't know what the exact API for that type would be, but if you're saying that someone should repr(transparent) wrap that VolatileI32 value in some new wrapper type that offers the correct bit-manipulation and also has some sort of read and write methods that pass through to read and write on the VolatileI32, that's not a good plan. Often you have the same data at multiple locations (eg: controlling a series of identical peripherals, or something like that). If more than one location stores the same data type you don't want the manipulation of that data type to be linked to a particular location. So to have a good experience there needs to be a separation between the data type that you're manipulating a value of within the CPU and a particular address which stores one of these data values.

I think the fundamental thing here is that solving "the tearing issue" is like the least important part of the entire MMIO story for Rust. I won't say that it's not a problem at all, but I will say that no actual MMIO API for a device, that's like peer reviewed for quality and such, and trusted by the community, will ever end up with declarations that are forced to tear, if they care about tearing at all. Because let's also be clear: you might not care about tearing. You might be totally fine with a tear happening. It depends on the device, it depends on the situation. I spend more of my time being concerned that a careless call to Vec::with_capacity can abort the process on accident than I do worrying about volatile being used to make a torn read.

Back to the main topic: These bulk operations are not at all "unusable" if they're allowed to change access width. They're just slightly less usable. Either way, the docs just need to specify if a width change is allowed or not. There certainly are situations that I've encountered where I've wanted the bulk operations, and where I have not cared at all about the access width. I just needed a bulk transfer to happen, and I didn't want to think about the details I wanted LLVM to do the thinking. In fact, I would have been entirely happen for LLVM to do 256 byte transfers rather than 4 byte transfers to complete the copy. Other times, I happen to know that I need it to emit precisely an STRB instruction with a given region of memory, and I wouldn't be happy with doing a bulk copy with any other width.

So just have the bulk ops say that they do or don't allow a change in access width and then we'll be fine.

Jul 21 '21 23:07 Lokathor

I agree that trying to specify access width for larger ranges or composite types is fraught -- any decision is likely to block off optimizations using wider hardware in the future. For the use cases that have come up in Fuchsia we've had no need for specific access width in the bulk shared memory operations I've seen and it would be fine to leave the width deliberately unspecified/unstable. As an example, the current volatile_copy_nonoverlapping_memory intrinsic lowers to a non-elided memcpy which AFAIK makes no guarantees about access width.

IMO, if doing MMIO or another operation that relies on specific (non-)tearing behavior, it's better to use the existing intrinsics with explicit-width-and-alignment pointers. Also IMO, I suspect that Rust should not attempt to define a type like Volatile<T> for all of the spec ambiguities already cited here that AIUI currently affect C++. All of the use cases I've seen would be satisfied by abstractions built atop raw pointer access using bulk volatile access for variable-length regions.

Jul 22 '21 14:07 anp

One possible point of confusion that came up from an offline discussion: is it intended that the new functions would be generic or that they would take *mut u8? I had assumed the latter and was confused about the need to specify volatile behavior for complex types but I think that might have been an incorrect assumption.

Jul 22 '21 14:07 anp

I'm sorry, this totally slipped off my radar. The main motivation when I originally wrote this was to have an unelideable memcpy available from a stable compiler by leveraging the intrinsic. Issues of tearing and so forth seem mostly orthogonal; if that matters for whatever one is trying to apply this to, perhaps one is better served with a hand-rolled loop or sequence of statements.

It made sense to me that this could be generic over some type T in the same way that std::ptr::copy is generic over T.

Jul 22 '21 21:07 dancrossnyc

I think the fundamental thing here is that solving "the tearing issue" is like the least important part of the entire MMIO story for Rust. I won't say that it's not a problem at all, but I will say that no actual MMIO API for a device, that's like peer reviewed for quality and such, and trusted by the community, will ever end up with declarations that are forced to tear, if they care about tearing at all.

But this is just because what Rust happens to do wrt tearing is sensible, right?

Right now, Rust could implement write_volatile and read_volatile as a sequence of individual 1-byte-accesses (maximal tearing) and this would not violate any part of the spec or docs. I would be extremely surprised if this change would actually be okay, for those MMIO users where tearing matters. In other words, right now all those MMIO users depend on unspecified behavior. That's hardly a good situation to be in.

Or am I missing something?

Jul 23 '21 19:07 RalfJung

Issues of tearing and so forth seem mostly orthogonal; if that matters for whatever one is trying to apply this to, perhaps one is better served with a hand-rolled loop or sequence of statements.

The problem is that there currently is no way to hand-roll that loop. We simply don't offer the APIs for that.

You can use read_volatile::<i32> and hope that this will be a single, non-teared 4-byte access. But hoping is the best you can do here, with current Rust.

Jul 23 '21 19:07 RalfJung

Right now, Rust could implement write_volatile and read_volatile as a sequence of individual 1-byte-accesses (maximal tearing) and this would not violate any part of the spec or docs.

Wellllll, so people are relying on the fact that when rustc doesn't say / care then LLVM rules generally take over. And LLVM has the rule:

the backend should never split or merge target-legal volatile load/store instructions.

https://llvm.org/docs/LangRef.html#volatile-memory-accesses (close to the end of the section).

So people are relying on rustc keeping to that, and not splitting or merging any target-legal access, such as an i32 access on most any target.

So I would advocate that Rust officially adopt at least that much of the rules to apply to all backends.

Or the backend could just error out if volatile is used and the backend doesn't support it, I guess? I mean a backend without volatile support can still build plenty of other programs, so maybe that's optional for alternative backends, whatever, that's for T-compiler to care about I think.

Jul 23 '21 19:07 Lokathor

Right now, Rust could implement write_volatile and read_volatile as a sequence of individual 1-byte-accesses (maximal tearing) and this would not violate any part of the spec or docs.

I think this would be bad for MMIO use cases of those functions, although I agree with @Lokathor that in practice most users of volatile in Rust are relying on the LLVM spec as well given our underspecification (it's what I did to convince myself the copy_nonoverlapping version was what we needed). Extending ours to reflect that somehow seems reasonable to me but I could be missing something.

In my case, I'm not doing MMIO, I just want to ensure that LLVM never decides to eliminate a "dead store" to memory that it can't see used (the unelidable memcpy @dancrossnyc refers to, IIUC). For that purpose it would probably be an unacceptable performance regression to split the operations like that but I wouldn't rely on the language specification there -- I'd rely on LLVM's desire to maintain performance and the lack of access width guarantees for the copy_nonoverlapping_volatile variant. (If I'm honest, that's the only function I've needed.)

Jul 23 '21 20:07 anp

the backend should never split or merge target-legal volatile load/store instructions.

"should" seems awfully weak for something that code relies on. And "target-legal" also seems underdefined? This means when I write Rust code that uses volatile accesses I need to exactly know the target to reason about tearing -- this is completely against how Rust usually handles target differences, e.g. for Atomic* we make sure that either the target supports what the program needs or the build fails. The current volatile API does not even let the program express what it needs, though.

But I don't actually have much of a stake in the volatile game, so if the people relying on voaltile are happy with such a guarantee, that's fine for me. However, we should put this guarantee into the Rust docs, since "falling back to what LLVM guarantees" will easily give wrong results. (And I think this is T-lang territory, since it is about the guarantees the language provides.)

Jul 24 '21 09:07 RalfJung

I agree that "should" and "target-legal" are vague enough that I'd prefer them to be more specified if possible. However, as you've said yourself before, we can't tell people we're going to give them something that LLVM won't implement.

Maybe the LLVM devs would be willing to make it more strongly specified? I've never asked them.

The current volatile API does not even let the program express what it needs, though.

I'm not quite clear on what you mean here. The type of the pointer determines the type of the access that you want, so it seems expressible to me.

for Atomic* we make sure that either the target supports what the program needs or the build fails.

While that's true, I think that's just part of the story. To me the full story is that by offering specific AtomicFoo types we can give them a fully safe API. Creation is safe, accessing is safe. The safety isn't sensitive to any particular address being used for the atomic value. Any aligned address of the correct span is just fine, on any device that supports the atomic ops at all. We can know the minimum alignment and memory span statically without knowing anything about the device that the code will run on just by having the reference in the first place. It all lines up nicely.

Contrast with something like a VolatileU16 type. Can we make this have a fully safe API? No. We can make creation or access be safe, but we can't make both of them safe. Fine, I have to write a little unsafe code, so what are the safety rules? You have to go consult the hardware data sheets. Does knowing the target triple help me? No, the data sheets are specific to an exact device, not a general target triple. Compiling for the thumbv4t-none-eabi target does not mean that 0x0400_0000 is necessarily an MMIO address for anything at all. The code is required to be more specific than a rustc target profile ever really gets.

Which means that my summary would be:

If use pointers and read_volatile/write_volatile: we have to write device specific code, and the target triple doesn't help us much.
If we use a restrictive set of VolatileFoo types similar to the AtomicFoo types: we have to write device specific code, and the target triple doesn't help us much.

Jul 24 '21 15:07 Lokathor

rfcs rfcs copied to clipboard

Stabilize volatile copy and set functions

rfcs
rfcs copied to clipboard