utils icon indicating copy to clipboard operation
utils copied to clipboard

zeroize: replace `atomic_fence` with `optimization_barrier`

Open newpavlov opened this issue 3 weeks ago • 35 comments

newpavlov avatar Dec 15 '25 13:12 newpavlov

I really don't understand the motivation for this or why you're making it a public function.Something like this seems like a potential replacement for the existing atomic_fence function.

Can you start with a demonstration of a problem in the existing implementation, then show what this solves?

Making it public seems like an implementation detail leaking out of the API.

tarcieri avatar Dec 15 '25 13:12 tarcieri

You can see a motivation example in the function docs. We can not "zeroize" types like NonZeroU32, but they still may contain sensitive data. It also could work as a safer alternative to zeroize_flat_type. Another use case is a type which is defined in a different crate, does not provide direct access to its internal fields (but could be reset using for example Default::default()), and does not support zeroize.

I encountered a need for this type when working on block buffer for rand_core. It explicitly does not want to depend on zeroize, but we would like to erase data from block buffer defined in it. Right now it's done using an (IMO) ugly hack (see https://github.com/RustCrypto/stream-ciphers/pull/491).

Something like this seems like a potential replacement for the existing atomic_fence function.

Yes. I plan to do it in a separate PR.

newpavlov avatar Dec 15 '25 14:12 newpavlov

I'm not sure I understand the issue in https://github.com/RustCrypto/stream-ciphers/pull/491 or what it's even trying to do... zeroize the entire keystream it generates?

tarcieri avatar Dec 15 '25 14:12 tarcieri

The current master version of rand_core implements BlockRng wrapper which handles buffering of generated RNG blocks. We want to zeroize the results field in chacha20 AND we do not want for rand_core to depend on zeroize, so it's resolved with the Generator::drop hack. This method gets called in Drop impl of BlockRng with reference to the results field.

With the observe function we could write:

// in rand_core
impl<R: Generator> BlockRng<R> {
    pub fn zeroize(&mut self) {
       self.results = Default::default();
    }
}

// in chacha20
struct ChachaRng(BlockRng<ChaChaBlockRng>);

impl Drop for ChachaRng {
    fn drop(&mut self) {
        self.0.zeroize();
        zeroize::observe(self);
    }
}

newpavlov avatar Dec 15 '25 14:12 newpavlov

I'm still not sure I follow... what calls the drop method of Generator, and what's the problem with that?

Where exactly is the "hack"?

tarcieri avatar Dec 15 '25 14:12 tarcieri

How is...

impl Drop for ChachaRng {
    fn drop(&mut self) {
        self.0.zeroize();
        zeroize::observe(self);
    }
}

...any different from...

impl Drop for ChachaRng {
    fn drop(&mut self) {
        self.0.zeroize();
    }
}

tarcieri avatar Dec 15 '25 14:12 tarcieri

what calls the drop method of Generator, and what's the problem with that?

The Drop impl of BlockRng. The Generator::drop method has nothing to do with the declared functionality of the trait, i.e. generation of block RNG data. It's sole purpose is to enable the results zeroization in BlockRng without breaking its invariant (it stores cursor position in the first word similarly to our block-buffer).

...any different from...

Note that zeroize in the snippets is defined in the rand_core crate and does not use the zeroize crate (I shoul've named it differently to make it less confusing). So the compiler is allowed to remove self.0.zeroize(); in your second snippet.

newpavlov avatar Dec 15 '25 14:12 newpavlov

So the problem is that rand_core is shipping a half-assed version of zeroize internally, and you want a band-aid to make it more robust?

tarcieri avatar Dec 15 '25 14:12 tarcieri

In a certain (uncharitable) sense, yes.

But while rand_coreis indeed my primary motivation, I believe that the observe function would be useful even without it.

newpavlov avatar Dec 15 '25 14:12 newpavlov

Perhaps rand_core could just implement volatile writes? I'm not sure this proposed approach is any less of a hack than what exists currently.

I do still think something like this, internally within zeroize, would be a useful replacement for atomic_fence though (where the latter has questionable value to begin with: #988)

tarcieri avatar Dec 15 '25 14:12 tarcieri

Perhaps rand_core could just implement volatile writes?

Not everyone needs RNG buffer zeroization. Introducing a crate feature for it is also not desirable. I think that zeroization should be handled on the chacha20 side.

I don't understand why you are against exposing the function. It a simple safe functon which can not be misused (as opposed to zeroize_flat_type) and can reduce amount of unsafe in downstream crates while making the resulting code more efficient (e.g. by replacing inefficient scalar volatile writes with SIMD-based writes). Its only problem is that it operates in a somewhat grey zone without any guarantees from the compiler, but the same arguably applies to atomic_fence as well.

newpavlov avatar Dec 15 '25 15:12 newpavlov

I'm a bit worried about the notion that there's a user pulling in one crate which is expected to do some otherwise insecure zeroing, then separately pulling in zeroize to observe the writes the other crate is supposedly doing. Ensuring the entire operation is secure relies on implicit coupling between those two crates which the user is orchestrating, and the user has little way to tell if e.g. the zeroing operation in one crate has been removed and the observe is useless.

Perhaps for a start you could use this technique as an internal replacement for atomic_fence, and as a followup we can talk about exposing it?

tarcieri avatar Dec 15 '25 15:12 tarcieri

Ensuring the entire operation is secure relies on implicit coupling between those two crates which the user is orchestrating

Yes, it's unfortunate, but sometimes there is just no other way. Potentially unreliable erasure is better than no erasure at all. And it's much better than risking UB with zeroize_flat_type.

Also note my point about efficiency. observe-based code results in a better codegen. For example:

struct Foo {
    a: [u8; 32],
    b: [u64; 16],
    c: u64,
}

impl Drop for Foo {
    fn drop(&mut self) {
        // This impl is more efficient
        self.a = [0; 32];
        self.b = [0u64; 16];
        self.c = 0;
        zeroize::observe(self);

        // than this impl
        self.a.zeroize();
        self.b.zeroize();
        self.c.zeroize();
    }
}

See https://rust.godbolt.org/z/K7hxoePKE

Perhaps for a start you could use this technique as an internal replacement for atomic_fence, and as a followup we can talk about exposing it?

Personally, I don't see much point it it, but sure.

newpavlov avatar Dec 15 '25 15:12 newpavlov

Also note my point about efficiency. observe-based code results in a better codegen.

The existing performance problems are already noted in #743.

If you replaced atomic_fence with this style of optimization barrier (which would also address #988), we could consider getting rid of the volatile writes, at least on platforms that support asm! where we have guarantees.

tarcieri avatar Dec 15 '25 16:12 tarcieri

at least on platforms that support asm! where we have guarantees.

Well, IIUC even asm! does not provide 100% guarantees here, i.e. a "sufficiently smart" compiler may in theory analyze the asm! block and decide that it can be safely eliminated. I would love to have a written guarantee that asm! is a black box which can not be messed with by the compiler (unless something like options(pure) is explicitly provided), but IIRC there is nothing like this right now. And, as you may remember, there are even surprising interactions between asm! blocks and target features.

cc @RalfJung

newpavlov avatar Dec 15 '25 16:12 newpavlov

Well, IIUC even asm! does not provide 100% guarantees here, i.e. a "sufficiently smart" compiler may in theory analyze the asm! block and decide that it can be safely eliminated.

Do you have more information about that? I wasn't aware compilers would change codegen based on an analysis of the ASM.

tarcieri avatar Dec 15 '25 17:12 tarcieri

I did not mean that existing compilers do it, but that AFAIK it's not explicitly forbidden. For example, an advanced LTO pass could in theory apply optimizations directly to generated assembly, thus removing "useless" writes to stack frame which immediately gets released.

I tagged Ralf in the hope that he would clarify this moment in the case if I am mistaken.

newpavlov avatar Dec 15 '25 17:12 newpavlov

The details around asm! could easily fill a book, so without spending a lot of time (that I don't have) on digging into what exactly you are trying to achieve here, I can't give any solid advice. So just some notes:

  • Here's a comment I wrote where I sketched out a systematic way to reason about asm!, unfortunately without going into examples and details.
  • The compiler is not allowed to analyze the actual instructions in the asm block. (We haven't yet gotten LLVM to enshrine this promise in their docs, but we're hoping to get there.)
  • The concept of a "compiler barrier" doesn't really exist, and there's no sound reasoning principle that I am aware of that is based on "compiler barriers". They can be useful guide for intuition, but actually reliable reasoning needs more solid foundations. (This is related to how the actual specification of atomic fences has nothing at all to do with preventing the reordering of memory accesses; is is entirely about establishing happens-before relationships between certain memory accesses, which impose constraints on observable program behavior, which must be preserved by compilation. Arguing about the order of accesses in the final generated program is like arguing about which instruction the compiler uses to perform division -- there's no stable guarantee of any sort, other than how it impacts observable program behavior. And here, "observable" does not include memory accesses, except if they are volatile.)

At the same time, the entire concept of zeroize is to try to do something that can't reliably be done in the Rust AM. From a formal opsem perspective, reliable zeroizing is impossible. (I wish it were different, but sadly that's where we are. Even if we wanted we couldn't do much about that in rustc until LLVM has proper support for this.) Best-effort zeroizing is a lot more about what compilers and optimizers actually do than about their specifications, and I'm the wrong person to ask about that.

RalfJung avatar Dec 15 '25 19:12 RalfJung

The concept of a "compiler barrier" doesn't really exist, and there's no sound reasoning principle that I am aware of that is based on "compiler barriers"

@RalfJung what term would you use for this sort of tactical defense against code elimination that, at some future date, could be subject to another round of cat-and-mouse when the compiler outsmarts it?

tarcieri avatar Dec 15 '25 20:12 tarcieri

I still think the best thing we could do with asm! is use it to actually implement an optimized memset/bzero routine, where we have guarantees it wouldn't be eliminated, unlike any Rust code we write (which would also address #743)

tarcieri avatar Dec 15 '25 20:12 tarcieri

If we are to trust the compiler to not mess with asm!, then I think the empty "observation" asm! should be sufficient. Using separate asm!s for every type or one general memset-like function with separate asm!-based impls for every supported arch will be less efficient and more error-prone.

newpavlov avatar Dec 15 '25 21:12 newpavlov

what term would you use for this sort of tactical defense against code elimination that, at some future date, could be subject to another round of cat-and-mouse when the compiler outsmarts it?

The paragraph you quoted was talking about reliable / sound reasoning principles. If by "compiler barrier" you mean "a best-effort heuristic to steer the compiler away from things it shouldn't do", then I have no objections to the term (but I think that's not how many people use it). I also can't confidently give advice on that kind of a thing; you know a lot more about it than me.

RalfJung avatar Dec 15 '25 21:12 RalfJung

I think we only need one implementation of memset/bzero per architecture.

The implementation of zeroize itself is already abstracted such that there are two places to actually invoke it from: volatile_write and volatile_set (the latter being the main place that would benefit)

tarcieri avatar Dec 15 '25 21:12 tarcieri

@RalfJung Is the LTO optimization I described above technically legal? Or am I being too paranoid?

newpavlov avatar Dec 15 '25 21:12 newpavlov

If by "compiler barrier" you mean "a best-effort heuristic to steer the compiler away from things it shouldn't do", then I have no objections to the term (but I think that's not how many people use it)

I’ve heard “optimization barrier” used for this by several people, though if that sounds like it has a property this isn’t providing, perhaps something like “optimization impediment” is clearer?

tarcieri avatar Dec 15 '25 21:12 tarcieri

@tarcieri

I think we only need one implementation of memset/bzero per architecture.

I only see disadvantages when compared to the observation asm!.

We don't win anything in language guarantees. While by making it general you prevent the compiler from applying optimizations which we want, such as unrolling loops, reusing zeroed registers, or SIMD-ifying code when appropriate. And it obviously much harder to write correctly than slapping the same empty asm! for all arches.

newpavlov avatar Dec 15 '25 21:12 newpavlov

@RalfJung Is the LTO optimization I described above technically legal? Or am I being too paranoid?

That sounds like it may be in conflict with self-modifying code which needs the asm blocks to be unchanged. OTOH something like BOLT could do all sorts of "fun" things and we'd probably say it's fine, it's just incompatible with self-modifying code. But you're asking about things way outside of what I can confidently talk about. I suggest you bring this to the t-opsem Zulip; there are other people there that know more about these low-level things.

I’ve heard “optimization barrier” used for this by several people, though if that sounds like it has a property this isn’t providing, perhaps something like “optimization impediment” is clearer?

I always interpreted "barrier" as something that actually guarantees certain reorderings don't happen. So yeah I think a term that makes the best-effort nature more clear would be preferable.

RalfJung avatar Dec 15 '25 21:12 RalfJung

I only see disadvantages when compared to the observation asm!.

By using asm! to write zeros rather than Rust code, you get actual guarantees those writes will be performed as opposed to a "compiler impediment" which may possibly be outsmarted by a sufficiently smart future compiler, as you yourself have argued (although to be fair, the latter is an exceedingly common way of implementing this kind of primitive).

We get somewhat similar guarantees out of ptr::write_volatile at the cost of performance, so if that weren't removed, we have a somewhat nice belt-and-suspenders setup were the optimization barrier/impediment to be added.

If we remove the volatile writes to improve performance, I don't think we really have any guarantees at that point, just something future compiler versions are unlikely to see through. Maybe that's fine, as I said earlier it seems fine for many e.g. libc implementations and their memset_s-style functions.

tarcieri avatar Dec 15 '25 21:12 tarcieri

Last time I asked about LTO I was told LTO would not touch asm!

tarcieri avatar Dec 15 '25 21:12 tarcieri

may possibly be outsmarted by a sufficiently smart future compiler

My point was that it also fully applies to asm!-based implementations of memset. For the above mentioned BOLT there may be zero difference between mov we got from plain write, write_volatile, or asm!. Such tool would just see that we perform "useless" writes into memory which gets immediately released.

e.g. libc implementations and their memset_s-style functions.

IIUC they rely on dynamic linking to act as an optimization "impediment" (in other words, on keeping the functions "external" and thus out of the optimizer reach). We probably do not want to rely on it by default.

newpavlov avatar Dec 15 '25 21:12 newpavlov