reference
reference copied to clipboard
&str and &[u8] have the same layout
Currently, str and [u8] are promised to have the same layout, but &str and &[u8] are not promised to have the same layout. The std currently assumes that they are promised to have the same layout (https://doc.rust-lang.org/src/core/str/converts.rs.html#172), so this change would have no impact beyond codifying what is already in practice. This PR defines &str and &[u8] to have the same layout, though what that layout is continues to be unspecified.
There are some further steps here that I didn't take:
- Every rule about slices should probably also apply to
str. I have addedstrin several places in the reference where it otherwise refered to slices, but likely the definition of a slice should also simply includestr. This is a bigger conversation and frankly unimportant if... - Some version of https://github.com/rust-lang/rust/pull/107939 ever getts stabilized. In that case, all of this doesn't matter and
strwould be removed from the reference. This seems to me to be obviously the better choice.
In any case, this PR represents a fairly incrementalist approach.
Thanks for the insight of those on the Zulip thread here
@rustbot label: +I-lang-nominated +T-lang
@rustbot label: +I-lang-easy-decision
Unknown labels: I-lang-easy-decision
FTR, the standard library has every right to make assumptions about the implementation of the language beyond what the language does guarantees, because it is intrinsically tied to rustc. Not necesssarily a point against making a decision here, but I don't think it's a strong point in favour of stabilizing the equivalence either.
FTR, the standard library has every right to make assumptions about the implementation of the language beyond what the language does guarantees, because it is intrinsically tied to rustc. Not necesssarily a point against making a decision here, but I don't think it's a strong point in favour of stabilizing the equivalence either.
Agreed -- I actually am going to reword this to make it clear that I mean this isn't a change for Rustc, only a codification of existing decisions
The std currently assumes that they are promised to have the same layout (https://doc.rust-lang.org/src/core/str/converts.rs.html#172)
The layout for transmute doesn't matter. I guess the safety comment was about str and [u8] having the same layout. For the transmute what matters is:
- The size of
&strand&[u8](that's part of the layout, but alignment doesn't matter as explained in the documentation of transmute). - The validity invariant of
&[u8]must imply the validity invariant of&str(i.e. all valid values of type&[u8]must be a valid value of type&[u8]), this is where the layout ofstrand[u8]matters (among other things).
That's only for safety. For correctness, we also need valid values of &[u8] to have the same representation at &str for that same value. In other words the representation relation of &[u8] must be included the one of &str (it's not enough for their domain to be included, they must map to at least the same values). In practice they are equal, but transmute only needs one direction (the one of the transmute).
So I'm not sure guaranteeing that &[u8] and &str has the same layout is a correct answer to making the std code look like code that users can write. That said, such guarantee could be useful for other purposes, I'm just saying that the motivation in OP doesn't seem to justify the change.
@ia0 What would you suggest for how the documented guarantees should be strengthened to match what the PR author obviously wants?
What would you suggest for how the documented guarantees should be strengthened to match what the PR author obviously wants?
I'm going to assume "what the PR author obviously wants" is "guarantees that transmuting between &str and &[u8] is safe". In that case the title of the PR should be "&str and &[u8] have the same representation relation".
The problem is that the Reference doesn't yet have this concept. Ralf asked for such concept in https://github.com/rust-lang/reference/pull/1752#discussion_r1989485881:
Conceptually, what we eventually need is for every type a description of which byte sequences are valid for this type, and which value is represented by each valid byte sequence.
This was lost in the middle of a PR review so I guess it didn't get the attention it could have. But I'm assuming this request has gone through other channels given how critical it is for users writing unsafe code and relying solely on the Reference.
The only type AFAICT that has a representation relation defined at this time is bool under [type.bool.repr] (along with [type.bool.layout] to know it's a single byte, although it's somehow implicit already in [type.bool.repr]).
Today, one can somehow get close to the domain of the representation relation (i.e. which byte sequences are valid, but not which value they represent, which matters for unsafe code where some amount of correctness matters) by combining information about the type layout (its size, relative offset of fields, and discriminant representation, but not its alignment) and the bit validity. However the bit validity is currently often underspecified. For example I didn't find the bit validity of u8. I would expect a sentence "all initialized bytes are valid at u8" under [type.numeric.validity]. We would also need something like that for pointers and metadata such that we can express "what the PR author obviously wants" using type layout and bit validity. And ideally, we would also map the representation to the value (unsigned integers would take endianness into account for example, and signed integers would specify two's complement).
In my opinion, it would be better to wait until we have a notion of representation relation, such that all such guarantees for unsafe users can be specified in a uniform way. In the meantime, unsafe users should refer to the Unsafe Code Guidelines and other documentation like MiniRust, but this is somehow in opposition with https://github.com/rust-lang/unsafe-code-guidelines/pull/566#issuecomment-2849275158:
Frankly, my $.02 is that we deprecate the UCG as a whole and transition to making guarantees in the Reference (and other documents like the minirust spec for a more programmatic definition).
So maybe adding the representation relation to the Reference should be prioritized?
I think, though that's a valid goal, your point is really to make a more rigorous definition of "layout", particularly using the term "representation relation" instead with some well defined meaning. A change like that, as you note, would require some large work in the spec (at the minimum, looking at all the usages of "layout") and so I think should come in as a separate PR/RFC. Particularly, I'm comfortable saying "whatever 'layout' means, &str and &[u8] have the same one." I can amend the PR to make that clearer.
How does that sound?
your point is really to make a more rigorous definition of "layout"
That would be an editorial question. I'm not saying that. I'm saying a new concept of "representation relation" (in addition to "layout") should be added. How that's implemented is up to editors of the Reference. There are at least 3 options:
- Define one or more common concepts shared by both "layout" and "representation relation" to factorize common aspects like size, field offsets, and discriminant. At its extreme, all those subconcepts would be their own concepts. Each type defines its size, its alignment, its field offsets, its ABI, etc independently (properties can be specified on those concepts, e.g. size is a multiple of alignment). Then higher-level concepts such as "layout" or "representation relation" may use those definitions, possibly only packaging them without additional processing like "layout". (If I were to choose an option, it would be this extreme option.)
- Create a new independent concept of "representation relation". This will repeat some stuff with "layout", but there's already some form of repeating in the Reference (most of it unavoidable like the
boolexample). - Try to butcher one concept into the other. Not sure how this would work, but I don't think it would be better than any of the 2 other options above.
I'm comfortable saying "whatever 'layout' means, &str and &[u8] have the same one."
But how would this solve the transmute problem? What transmute asks is that the sequence of bytes being transmuted is valid at both the source and destination type. The layout of those types is not what you need to satisfy this requirement. You need the validity invariant (aka the domain of the representation relation). The layout is neither sufficient (it needs bit validity) nor necessary (it talks about alignment).
To be clear, I'm not against this change (it seems reasonable to me), I'm just arguing that we should not believe that it will solve the transmute problem. And thus we should document the motivation for this change (assuming changes to the Reference need to be motivated).
When I say "layout" (and I suspect when the std says the same), I am including field offsets in that -- that is part of the validity of the transmute. I should be more careful in the text about that, but since "layout" is still used vaguely in the reference, i think it would be best to wait on that.
So the transmute I referenced is saying "the offsets where the pointer and the length is stored are the same in &[u&] and &str". The std can currently say that is true as a point of fact and this RFC is to make that true as a point of reference.
I might not be following you perfectly though -- you seem much more versed in programming language theory -- so let me know if I'm missing your point!
When I say "layout" (and I suspect when the std says the same), I am including field offsets in that
Yes, layout is "size + align + field offset + discriminant" from layout.intro. I think the problem is on what follows.
that is part of the validity of the transmute
Indeed, &str and &[u8] having the same layout can be used as part of the argument (but doesn't need to).
saying "the offsets where the pointer and the length is stored are the same in &[u&] and &str"
This is not implied by &[u8] and &str having the same layout. The reference doesn't say anything about a possible pointer and metadata field for wide pointers. I would expect to see this under layout.pointer.unsized but it only talks about size and alignment (only giving a precise definition in a note). And if it did, it would also need to talk more precisely about the validity invariant of those fields.
In other words, while the sentence "&[u8] and &str have the same layout" could be used to prove a transmute between those types, it is neither necessary nor sufficient in theory, and can't be used in practice with today's Reference. On the contrary, the sentence "&[u8] and &str have the same validity invariant" is exactly what's needed to prove a transmute between those types. The Reference doesn't have this notion yet.
So you could see this PR as a step towards proving the transmute with the Reference, but it's not a step perfectly aligned with that goal, because it also guarantees something about alignment which is not needed (and currently not guaranteed although true in practice now and most probably always).
@ia0 Sorry, I do not think you are contributing anything further here. It seems to be an obviously preexisting problem. Please open a PR against the reference to address the concerns you have.
Given the existing text that
String slices are a UTF-8 representation of characters that have the same layout as slices of type
[u8].
then I think any concerns I'd have about what "layout" means exactly would also apply there, so overall I think this guarantee makes sense.
That said, doing it just for &str was surprising to me. Why specifically & but not &mut nor *const nor *mut? Or is that leaning on some other statement that those are already necessarily the same, so doing it for & implicitly does the others?
I'll also cc https://github.com/rust-lang/rfcs/pull/3775, which I think if it lands will necessarily make this guarantee as well.
@scottmcm :
That said, doing it just for &str was surprising to me. Why specifically & but not &mut nor *const nor *mut? Or is that leaning on some other statement that those are already necessarily the same, so doing it for & implicitly does the others?
Exactly -- that mirrors what @kpreid referenced above too. Because all pointers have the same layout:
Pointers and references have the same layout. Mutability of the pointer or reference does not change the layout.
I think we don't need, therefore, to restate that in this section. I think, as @kpreid said, it would be best if we had a term like "all primitive pointers to str have the same layout as all primitive pointers to [u8] and then a link to that subsection.
However, since we don't currently have any rhetoric for "primitive pointer", I think that can be saved for another RFC which could then be backported to this subsection. In the meantime, I wouldn't mind adding link to https://doc.rust-lang.org/stable/reference/type-layout.html#r-layout.pointer.intro so readers understand that &str having the same layout to &[u8] implies &mut and *const and *mut str has the same layout as all the others to [u8].
I'll also cc https://github.com/rust-lang/rfcs/pull/3775, which I think if it lands will necessarily make this guarantee as well.
Yes, I think if we're ready to pull the trigger on that RFC, then this PR is naturally included with that. However, assuming accepting that RFC is less immenent, accepting this PR first will simplify that RFC -- it will only have to talk directly about &/&mut/*const/*mut [T] and could simply include a short note that since &/&mut/*const/*mut str has the layout as others to [T], that RFC also applies to str. Currently, that RFC both declares that &str and &[u8] have the same layout and describes that layout -- it would be nice for it to only have to do one thing at a time.
It seems to be an obviously preexisting problem. Please open a PR against the reference to address the concerns you have.
The PR already exists, it's https://github.com/rust-lang/reference/pull/1664. Not sure how I can help move it forward.
And my point is actually part of that PR, in particular https://github.com/rust-lang/reference/pull/1664#discussion_r2004623860.
But I think I now understand "what the PR author obviously wants". It is not to provide guarantees that would make a transmute from &[u8] to &str safe. It is simply to guarantee a statement written in a safety comment of the standard library. This motivation perfectly matches the PR. I initially misinterpreted the motivation thinking it was related to unsafe code because of the link to a safety comment.
We discussed this in today's @rust-lang/lang meeting. We're happy to start an FCP to approve this:
@rfcbot merge
Please note that this change doesn't necessarily make it ABI-compatible (e.g. casting between function pointers where a parameter type changes between &str and &[u8]). That would require a separate proposal, and some careful evaluation of present and future targets (including wasm targets).
Team member @joshtriplett has proposed to merge this. The next step is review by the rest of the tagged team members:
- [x] @joshtriplett
- [ ] @nikomatsakis
- [x] @scottmcm
- [x] @tmandry
- [x] @traviscross
No concerns currently listed.
Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!
cc @rust-lang/lang-advisors: FCP proposed for lang, please feel free to register concerns. See this document for info about what commands tagged team members can give me.
@rfcbot reviewed
cc @rust-lang/opsem @RalfJung
:bell: This is now entering its final comment period, as per the review above. :bell:
But I think I now understand "what the PR author obviously wants". It is not to provide guarantees that would make a transmute from &[u8] to &str safe. It is simply to guarantee a statement written in a safety comment of the standard library.
I would like to do both. I see what you are saying, that the terminology is not very rigorously defined in the reference (and I would like to improve that), and that an arbitrary transmute is not fully justified by only the layout guarantee. But I also agree that there's no reason not to add said guarantee to the reference.
Maybe the aforementioned RFC can help us provide more of that justification.
@rfcbot reviewed
The final comment period, with a disposition to merge, as per the review above, is now complete.
As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.
This will be merged soon.