encoding_rs
encoding_rs copied to clipboard
set_len on a Vec<u8> of uninit is UB
encoding_rs currently has UB in the form of creating uninitialized u8's via set_len Here are 2 examples where the UB is crystal clear:
https://github.com/hsivonen/encoding_rs/blob/dd9d99bb185f93d4fe5071291cdc54278e193955/src/mem.rs#L2007-L2010
https://github.com/hsivonen/encoding_rs/blob/dd9d99bb185f93d4fe5071291cdc54278e193955/src/mem.rs#L2044-L2047
set_len is also used in 7 functions in lib.rs, but I haven't looked at them very closely.
The docs for set_len explicitly say https://doc.rust-lang.org/std/vec/struct.Vec.html#method.set_len :
The elements at old_len..new_len must be initialized.
Some relevant discussion can be found here https://github.com/rust-lang/unsafe-code-guidelines/issues/71
rustc itself has a lint specifically for this kind of thing: https://github.com/rust-lang/rust/issues/75968
Using MaybeUninit::uninit().assume_init() is instant UB unless the target type is itself composed entirely of MaybeUninit
My understanding is this is currently considered UB, but this rule may be relaxed in the future to allow types where all bit patterns are valid to store uninitalized if they are not read from.
Hmm.. Having a look at this, it looks like functions like convert_latin1_to_utf8_partial will need to have a maybe_uninit variant added
/// # Safety
/// You may not deinitialise any T here.
unsafe fn mucast<T>(x: &mut [T]) -> &mut [MaybeUninit<T>] {
&mut *(x as *mut [T] as *mut [MaybeUninit<T>])
}
pub fn convert_latin1_to_utf8_partial(src: &[u8], dst: &mut [u8]) -> (usize, usize) {
convert_latin1_to_utf8_partial_uninit(src, mucast(dst))
}
pub fn convert_latin1_to_utf8_partial_uninit(src: &[u8], dst: &mut [MaybeUninit<u8>]) -> (usize, usize);
It would need to promise that no previously init bytes are made uninit (in order to make the cast from &mut [u8] to &mut [MaybeUninit<u8>] sound, but that seems reasonable enough as a promise to make
This would be a useful change to make to consumers, as it would let them use the mem functions to stack allocated buffers soundly, without needing to zero init them.
edit: These functions wouldn't have to be made public, of course, but they would need to exist in some form, and I think it might be useful for consumers.
Per @Gankra's tweet, &mut [u8] to unintialized integers isn't UB if only writing, citing @RalfJung. My reading is that it applies here. Do I understand correctly?
If at all possible, I'd like to avoid complicating the outward API (including the FFI part in encoding_c) with MaybeUninit.
I think there's 2 questions here.
- Is it UB to have a
&mut [T]to uninitialized T - Is it UB to call
set_lenon a Vec to cause uninitalised data to be counted as being initialised
The answer to 1 seems to be "probably not, but it's not definitely not UB?" (see: https://github.com/rust-lang/unsafe-code-guidelines/issues/77)
2 seems to be yes, going off the safety requirement on Vec::set_len.
Thought that requirement could be closely tied to the answer to 1, so Vec could weaken that safety requirement. It would still be UB to observe an uninit u8 by value, but we don't seem to be doing that here (because miri passes).
If 1. is not UB, then I don't think the public API would need to change at all, you'd just take a &mut [u8] and then convert it into a &mut [MaybeUninit<u8>] internally if it's an API that allows uninit bytes to be passed. (And doing that conversion isn't mandatory, but it means you work with a type that's a lie only as much as you have to.
Is it UB to have a &mut [T] to uninitialized T
This is currently deep in no-mans-land. The reference clearly makes it UB, as does the documentation of MaybeUninit::assume_init_mut. However I think it should be allowed if T is inhabited, and in particular should be allowed for u8, and MiniRust accordingly allows it. But that is not necessarily the consensus in UCG or T-Lang, so the final rules might end up making this UB.
So what Gankra writes is correct, in the sense that I think we should change the rules to allow this. But the current rules clearly don't allow it and I don't decide these rules by myself.
I think there's 2 questions here.
- Is it UB to have a
&mut [T]to uninitialized T- Is it UB to call
set_lenon a Vec to cause uninitalised data to be counted as being initialised
...
2 seems to be yes, going off the safety requirement on
Vec::set_len.Thought that requirement could be closely tied to the answer to 1, so Vec could weaken that safety requirement.
AFAICT, the UB concern related to 2 is precisely about 1, and, therefore, if 1 isn't UB when T is a primitive integer, it seems to me that encoding_rs's use of set_len isn't UB, either. (Why would set_len even be exposed as a public API if it didn't have precisely the use case that encoding_rs uses it for as non-UB according to the 2015 understanding of UB?)
(Why would
set_leneven be exposed as a public API if it didn't have precisely the use case that encoding_rs uses it for as non-UB according to the 2015 understanding of UB?)
Well, I guess the example given is distinct, since it has C code doing the writing to the uninitialized u8s.
The safety comment of set_len currently reads
The elements at old_len..new_len must be initialized.
That is a library API comment. I do not know when it was added (it has not always been there), but what it means seems fairly clear.
In terms of language UB, it would be possible to allow set_len to be used before elements are initialized, but then the question is how does one go about initializing them? vec[idx] = val creates an &mut T to the uninit element so that would be potentially problematic. vec.as_mut_ptr().add(idx).write(val) would work though. But it is up to t-libs-api to decide if they want to bless this pattern or not.
how does one go about initializing them?
spare_capacity_mut look perfect for this use case. While I agree nothing really bad should happen I don't see why user shouldn't assume it's UB, cause if code change or whatever, one day you do it for a Dropable item and you forget to do the set_len properly. I don't see any situation where you can't do the init before the set_len.