stream-ciphers icon indicating copy to clipboard operation
stream-ciphers copied to clipboard

Returning rand_core feature for chacha20 for use in rand_chacha

Open nstilt1 opened this issue 2 years ago • 48 comments

I borrowed some code from version 0.8.1 of chacha20, as well as rand_chacha and was able to get it to compile and pass tests. The main issues in my code that I identified were:

  • usage of [u8; 12] instead of u128 for set/get_stream(). It used to have a u64 when it had a 64 bit counter
    • This has been fixed, set_stream() takes the lower 96 bits of a u128 input. It now has support for using a [u8; 12] with set/get_stream_bytes()
  • usage of ParBlock<LesserBlock> as a temporary output buffer prior to copying it into results in fn generate(). It will probably leave a copy of itself in memory after use. It might be possible to change the BlockRngResults type to ParBlock<LesserBlock>
    • This has been fixed, although it uses a little bit of "unsafe" code to treat the 2D array as a 1D array in
      • AsRef<[u32]> for BlockRngResults and
      • AsMut<[u32]> for BlockRngResults
  • when calling set_word_pos() with an input that wasn't divisible by 16, get_word_pos() would round the input down to the nearest multiple of 16.
    • This has been fixed. It looks like I missed something from rand_chacha.

I've also added ZeroizeOnDrop for the BlockRngResults.

https://github.com/rust-random/rand/issues/934

nstilt1 avatar Oct 25 '23 18:10 nstilt1

Alright, I think I've added just about all I can think of. The #[cfg_attr... in lib.rs and some impls in rng.rs are there because rand/src/std.rs has a

#[derive(Clone, Debug, PartialEq, Eq)]
pub struct StdRng(ChaCha12Rng);

There are also some functions in rng.rs that don't use #[inline]. I'm not sure if they need it or not. Do you have any questions, suggestions, or feedback? Thanks for your time.

nstilt1 avatar Oct 26 '23 20:10 nstilt1

I'd suggest less unsafe code and more of a KISS implementation in an initial PR, and then circling back on some of it in a followup so it can be considered separately.

tarcieri avatar Oct 27 '23 14:10 tarcieri

Alright. Switching the safety on

nstilt1 avatar Oct 27 '23 21:10 nstilt1

I was able to use generics to ensure that the method inputs impld Into<{WrapperName}>, and then called .into() internally. Now the inputs should zeroize on drop when the zeroize feature is enabled, and end users will not need to use .into() on the input parameters. I was also able to remove all of the *_bytes() methods I had implemented and added impls for From<[u8; x]>.

The only final issue I ran into was for a way to convert a [u8; 12] to a [u8; 16] (or a u128 to a [u8; 12]) without producing an un-zeroizeable copy, so I added an impl for From<&mut [u8; 12]>. It could alternatively be done with an additional wrapper for [u8; 12]. Realistically, this impl is unnecessary, but I was using the set_stream_bytes() method in a test and it was somewhat more convenient for the test to use a [u8; 12]. The From<&mut [u8; 12]> is both unncecessary and less efficient and could be removed if you want.

Are there any additional changes that need to be made?

nstilt1 avatar Nov 03 '23 23:11 nstilt1

I think that is all I want to change. There is some stuff in rng.rs, particularly the Zeroizing structs that look a little bit... polluting. I could move them into a different file if needed, or it might be possible to use a generic struct or two. I probably won't be making any more changes until I receive feedback... although, I might add a test to make sure that the inputs actually get zeroized on drop. I'm also not sure if the methods with alternate inputs really need a comment about the performance—the difference is probably minimal, and it won't even affect the rate of the output.

Thanks for your time, for this wickedly fast ChaCha20 cipher implementation, as well as the other rand and c2-chacha authors and contributors.

nstilt1 avatar Nov 05 '23 02:11 nstilt1

I was able to fix the zeroize test, but the user would need to pass an &mut [u8; 32] (or worse, (&mut [u8; 32]).into()) for the original to be zeroized. It appears that we can only zeroize the copies we receive :/, but I feel like the way it is called now (which is the way it was before) is acceptable. Eg:

let mut seed: [u8; 32] = /* arbitrary value */;
let mut rng = ChaCha20Rng::from_seed(seed);
seed.zeroize();

Some questions that may need consideration:

  1. Do we want to or need to re-export rand_core in lib.rs?
  2. Previously, this crate was working in a way where rand_core was the only dependency for the random number generator (https://github.com/rust-random/rand/issues/872#issuecomment-575242181). Do we still want to reduce the dependencies for the RNG and/or bypass the write_keystream_blocks() method used in fn generate(), since the underlying code is almost entirely within chacha20?
  3. How does the code need to be tidied up? Should I move the input parameter wrappers and their impls to another file, or is it acceptable how it is?
  4. Is there any unnecessary code? Perhaps the AlteredState trait, or the from_seed method that I modified from new() in lib.rs, or some of the zeroizing code?
  5. Is there a better way to impl ZeroizeOnDrop for inputs rather than with wrappers and duplicated spaghetti code to determine if zeroize is enabled? It would be somewhat convenient if it could be done with something like this, within the method
impl SeedableRng for $ChaChaXRng {
    type Seed = [u8; 32];

    #[inline]
    fn from_seed(seed: Self::Seed) -> Self {
        #[cfg(feature = "zeroize")]
        seed.zeroize_on_drop();
        let core = $ChaChaXCore::from_seed(seed);
        Self {
            rng: BlockRng::new(core),
        }
    }
}

It would just be a little neat if it could be done with a method call that made no wrappers, no copies, didn't change the value's type, and didn't require the variable to be mutable. It seems kind of impossible without some kind of change to Rust though. Might be easier to just make SeedableRng allow a Seed of type &[u8; 32] rather than doing a bunch of under-the-hood zeroizing to make sure that it zeroizes.

nstilt1 avatar Nov 07 '23 02:11 nstilt1

I'll go ahead and supply the rationale for AlteredState. If we were to use any unsafe code to interpret [u8] as a [u32], it could mostly (if not completely; depends on the implementation) be contained within AlteredState. That is pretty much the only benefit, and it honestly doesn't have much of a purpose when there is only safe code. Since the methods of AlteredState are only used once and there is only safe code, it wouldn't hurt to just copy over the methods to where they are called.

nstilt1 avatar Nov 09 '23 03:11 nstilt1

@nstilt1 you seem to have checked in a .DS_Store file.

(You can configure an excludesfile in ~/.gitconfig to avoid this in the future)

tarcieri avatar Nov 11 '23 20:11 tarcieri

255b481 seems to have added back .DS_Store

tarcieri avatar Nov 12 '23 22:11 tarcieri

Now that we have the fill_bytes method at our fingertips, I will look at .apply_keystream() and see if I can reduce some of the overhead that fill_bytes has.

I am working on it in a separate branch. Will write some tests using the output of the current implementation.

nstilt1 avatar Nov 13 '23 05:11 nstilt1

Regarding the last commit, I noticed an "inconsistency" with the index being used by set_word_pos() being 4 bits when the full index is 6 bits. I attempted to "fix" that, but then I realized that it was being done that way for alignment purposes. Will try to align it again with minimal changes.

nstilt1 avatar Nov 19 '23 22:11 nstilt1

Something I probably should've mentioned much earlier, but it would be helpful if the implementation didn't use the cipher API so that cipher could be made an optional dependency (again, as it was in chacha20 v0.8)

That would greatly reduce the number of crates needed for the rand_core/rand_chacha-replacement use case.

tarcieri avatar Nov 19 '23 22:11 tarcieri

Something I probably should've mentioned much earlier, but it would be helpful if the implementation didn't use the cipher API so that cipher could be made an optional dependency (again, as it was in chacha20 v0.8)

That would greatly reduce the number of crates needed for the rand_core/rand_chacha-replacement use case.

I can give that a try, and I'll take a gander at v0.8. I'll let you know if I run into any problems.

nstilt1 avatar Nov 21 '23 03:11 nstilt1

@tarcieri Do you want the code for the ChaCha cipher to be pretty much exactly the same as it was in v0.8.1, including the code in src/backend/ and src/backend.rs? I just don't want to un-fix anything that may have been fixed in the current version, and if you don't want it completely reverted, would we still need to add src/backends/autodetect.rs and to have the ChaCha Core struct in there? Alternatively, the process_with_backend() method might be able to be converted to accept inputs that don't depend on cipher, or possibly with no inputs and it could just be a generate method that calls gen_par_ks_blocks() where possible, and it might not be desirable if there are use-cases where people wouldn't want it to generate the blocks in parallel. Also, as for the 32 vs 64 bit counter, it would be easier to have that functionality by reverting instead of reworking process_with_backend() since the previous version already supports it.

In the meantime, I will be transforming process_with_backend() to a method called generate() that overwrites the 256-byte ChaChaCore buffer using the backend, then moving some things around based on v0.8.1 to separate the dependent code. Let me know if there is something else you would prefer.

Here's a link to v0.8.1 in case you don't have the link handy: https://github.com/RustCrypto/stream-ciphers/blob/338c078d731692fba3b8256e45de2c3e334d46d8/chacha20/src/backend/autodetect.rs

nstilt1 avatar Nov 23 '23 20:11 nstilt1

Okay... so I was able to get the RNG to work using generate(), and I have 2 different versions in different branches. The optional_cipher branch fills a [u32; 64], and the unsafe_fill_bytes branch passes a pointer to generate() so it can directly write the results to the input slices of fill_bytes(). Both of these have about the same performance at about 0.994-1.001 cpb, so either one could be used for the RNG. The primary benefit of the pointer is that it works on a [u8; 256] as well as a [u32; 64] and it works on a &[u8]. And on paper, it should be faster for fill_bytes()... but it doesn't appear to be.

The drawback of both of these generate methods is that without the process_with_backend() method, all of the StreamCipher, StreamCipherCoreWrapper, StreamCipherSeek, etc methods would need to be reimplemented using a macro and the original performance of the cipher on avx2 was 0.93-0.96 cpb, while the current performance is about 1.01-1.2 cpb.

I think a valid solution for the cipher's performance would be to conditionally compile the process_with_backend() and generate() methods, along with the backend methods it uses so that it will hopefully be the same as before. Will give it a try, but it might look a bit messier than just having one method for both implementations to generate values. Will attempt to use macros so that each method is only written once in src/backends/

Edit: It does not seem to be possible to write a macro that can make a function with varying arguments, as well as with a varying block of code in the function. Will just conditionally compile the inner function and their associated methods for now.

nstilt1 avatar Dec 02 '23 17:12 nstilt1

Besides the currently failing checks, the main issues remaining are:

  • 64-bit counter for Legacy, ~and maybe XChaCha since it uses an 8-byte Nonce internally~
  • ~Potential issue for 32-bit machines: StreamCipherCore's remaining_blocks() method returns an Option<usize>. usize is 4 bytes on 32-bit machines and with a 64-bit counter, this method could fail to produce an accurate value, or it could throw an error, or maybe it just won't compile. I don't knowingly use a 32-bit computer, so I don't know exactly what would happen.~
    • stream_core.rs simply states to return None when the value is too large to fit in a usize
  • The organization of lib.rs regarding the cipher-dependent code. I think generate and process_with_backend should probably remain in lib.rs due to the use of the cpuid types. I'm not entirely sure how to use those outside of lib.rs where they seem to be defined. The Rounds structs/trait could also be moved into its own file like it was previously.

Something that could be done regarding alignment on 64-bit machines:

  • ~it probably shouldn't be too complicated to have a [u32; 64] buffer for 32-bit machines and a [u64; 32] buffer for 64-bit machines for the RNG. I don't know if that would help with performance any, especially if we were to use a pointer for generate(). But I feel like this might require minimal changes. I can take a look once all of the aforementioned issues are addressed or accepted.~
    • I thought that I saw somebody mention something like this in one of the issues that I read, but I could not find the original comment. It probably is unnecessary given that the current performance is decent, and the only way for a minimal solution to work would be if pointers were used, which would kind of make the type of buffer used unimportant since fill_bytes() skips the buffer for most of the fill.

~Something that could be interesting (although not necessary), would be a way for users to define their own ChaCha variants using exported macros, such as ChaCha22Rng, or perhaps using combinations with Legacy and IETF in the macro call. It wouldn't take much, although there would need to be a unique name input for the Rounds struct when it is defined since it does not appear to be possible to concatenate names for structs in a macro. We could also gate the macros behind a feature.~

nstilt1 avatar Dec 06 '23 03:12 nstilt1

XChaCha since it uses an 8-byte Nonce internally

XChaCha is based on the IETF construction internally, unless you're referring to a different XChaCha.

Something that could be interesting (although not necessary), would be a way for users to define their own ChaCha variants using exported macros, such as ChaCha22Rng

There are already far too many variants of ChaCha. We don't need any more.

ChaCha20 already used too much diffusion and is needlessly slow because of it. It would've been better standardized as ChaCha12. We certainly don't need any variants that run slower while providing no additional practical security value.

tarcieri avatar Dec 06 '23 14:12 tarcieri

XChaCha since it uses an 8-byte Nonce internally

XChaCha is based on the IETF construction internally, unless you're referring to a different XChaCha.

In variants.rs, I have a Variant trait that specifies the starting index of the nonce that is fed into ChaChaCore<R,V>::state: [u32; 16]. Currently, the starting index of the nonce is 14 for XChaChaVariant, while Ietf uses 13 as the start index. XChaCha passes the original tests with the start index at 14, fails tests at index 13, and the Ietf variant passes tests using 13 as the start index. There could be something wrong with the tests, and I also noticed that one of the test vector links for XChaCha returns a 404, and it is not indexed by the Wayback Machine. Some further evidence involves the XChaCha initialization code which uses XChaChaNonce[16..] of a 24-byte array for the nonce that it passes to ChaChaCore<R,V>.

https://github.com/RustCrypto/stream-ciphers/blob/3a53d4172a4aa6cf05cfbe455ea851d8072f05d9/chacha20/tests/mod.rs#L97C9-L97C87 The direct link in question: https://tools.ietf.org/id/draft-arciszewski-xchacha-03.html#rfc.appendix.A.3.2

I just think that if XChaCha were to use the IETF construction, it would pass a 12-byte nonce to ChaChaCore<R,V>::state[13..16] instead of an 8-byte nonce to ...[14..16]. And this isn't exactly a problem, but it is a minor discrepancy.

I will double check this other link for the paper, but I briefly looked at it and it does specify that it uses the IETF variant with a 96-bit nonce rather than the 64-bit nonce. It should have some test vectors to compare with. https://datatracker.ietf.org/doc/html/draft-arciszewski-xchacha-03

nstilt1 avatar Dec 06 '23 15:12 nstilt1

Here's a working link:

https://datatracker.ietf.org/doc/html/draft-arciszewski-xchacha-03

Note that we're building upon uses the IETF's ChaCha20 (96-bit nonce), not Bernstein's ChaCha20 (64-bit nonce).

tarcieri avatar Dec 06 '23 15:12 tarcieri

I found the documentation that causes this. I'm sure you were already aware of this lol, but I'm not entirely sure why it doesn't use the full 12 bytes by extending the XChaCha nonce to 28 bytes (or something). Perhaps it came from XSalsa, and maybe to be compatible with a 64-bit counter... but I don't know why it uses a "96-bit nonce" if 32 of those bits are 0s. I don't know. It seems to be fairly secure the way it is, but those bits could be prime bits of real estate for extending the nonce just a little bit more.

[2](https://datatracker.ietf.org/doc/html/draft-arciszewski-xchacha-03#section-2).  AEAD_XChaCha20_Poly1305

   XChaCha20-Poly1305 is a variant of the ChaCha20-Poly1305 AEAD
   construction as defined in [[RFC7539](https://datatracker.ietf.org/doc/html/rfc7539)] that uses a 192-bit nonce
   instead of a 96-bit nonce.

   The algorithm for XChaCha20-Poly1305 is as follows:

   1.  Calculate a subkey from the first 16 bytes of the nonce and the
       key, using HChaCha20 ([Section 2.2](https://datatracker.ietf.org/doc/html/draft-arciszewski-xchacha-03#section-2.2)).

   2.  Use the subkey and remaining 8 bytes of the nonce (prefixed with
       4 NUL bytes) with AEAD_CHACHA20_POLY1305 from [[RFC7539](https://datatracker.ietf.org/doc/html/rfc7539)] as
       normal.  The definition for XChaCha20 is given in [Section 2.3](https://datatracker.ietf.org/doc/html/draft-arciszewski-xchacha-03#section-2.3).

I'll see about getting the legacy counter back up to 64 bits, and I'll make a pull request for cipher afterwards to change the remaining_blocks() return type to an Option<u64> if that is acceptable. I don't know if any stream ciphers could have more than 2^64 remaining blocks, but that is the current 64-bit limit of usize. There are 131 reverse dependencies for cipher, so if changing that return type isn't acceptable, it might also be possible to conditionally compile the counter type based on 32-bit or 64-bit target_pointer_width. Having a 32-bit counter on 32-bit systems would still behave differently though, which would probably be unacceptable if we wanted to offer consistency/compatibility across different platforms.

nstilt1 avatar Dec 07 '23 00:12 nstilt1

Regarding the last commit... I wasn't trying to be salty or anything... I'm a little OCD and my code was technically incorrect. I don't mind which version of XChaChaVariant that the library uses, but if my own code (with the nonce start index at 14) confused me, it could confuse the next new contributor. I'm okay with either version, but I suppose a comment could have sufficed.

Also, I recant the 32-bit usize issue. Didn't see the comment in stream_core.rs about it.

I've begun working on the counter in my counter_support branch, and I'm having some issues. I don't have much experience with SIMD, but I have absolutely 0 experience with NEON and I don't see any tooltips in VS Code for it. If I'm still stuck on this after another week or so, then I might need some help. Or if the 64-bit counter isn't a priority, that is okay with me as well.

nstilt1 avatar Dec 11 '23 03:12 nstilt1

It would probably be good to split the changes to return the rand_core feature and the changes to support a 64-bit counter into separate PRs.

(I should also note the 64-bit counter is something of a niche use case where it would be nice to avoid undue complexity elsewhere)

tarcieri avatar Dec 14 '23 00:12 tarcieri

Okay. Some of the checks that are failing seem to have started failing after I split up the library into the various features, and I don't know what to do about them, or if the current features are the ones that you want present. I am unsure how to proceed with those failing checks, particularly cargo build --target thumbv7em-none-eabi

Also, I've still got to replace the impl_zeroize_on_drop and impl_zeroize_from macros from rng.rs since they are only used once.

nstilt1 avatar Dec 14 '23 01:12 nstilt1

Looks like a doctest failure. Surprised minimal-versions is the only thing catching it

tarcieri avatar Dec 14 '23 02:12 tarcieri

That last "fix" got it to pass on my end, but should those docs just go in chacha.rs? I am unfamiliar with the inner workings of rust docs.

nstilt1 avatar Dec 14 '23 02:12 nstilt1

3f82000 is fine though we usually use a slightly different pattern for that: https://github.com/RustCrypto/formats/blob/961e756/pem-rfc7468/src/lib.rs#L23-L24

tarcieri avatar Dec 14 '23 02:12 tarcieri

I believe it is mostly ready. Some possible things that are missing:

  • ChaCha20LegacyCore
    • could be added with pub type ChaCha20LegacyCore = ChaChaCore<R20, Legacy>;
  • XChaChaCore
    • It might need a wrapper, or we could make XChaCha8Core, XChaCha12Core, XChaCha20Core. Or we could expose the Rounds structs as well as the variants.rs structs
  • Benches
    • Will be adding some benches, and I'll try to organize them a little bit better. I noticed that criterion-cycles-per-byte is supposed to work on aarch64 with Linux, so I can boot up my Raspberry Pi 4 and try to bench it later today. Since the benchmarks will be different on aarch64-linux and aarch64-macOS, I'll see about putting them in
benches/
  bench_results/
    aarch64-linux/
    aarch64-macOS/
    x86_64-linux/

Let me know if any other changes need to be made.

nstilt1 avatar Dec 14 '23 19:12 nstilt1

That should be all of the instructions for getting cpb counts on a Raspberry Pi 4b.

I still don't know why fill_bytes() is slower than apply_keystream() on x86_64. It uses basically the same underlying code, where apply_keystream() does some extra steps. I've tried using a [u32; 64] instead of *mut u8, but the performance was roughly the same at about 0.99 cpb. I also benched its branch again just now to be sure, and it currently runs at 1.05 cpb.

nstilt1 avatar Dec 15 '23 03:12 nstilt1

I've just removed the duplicated gen_par_ks_blocks() in avx2.rs/neon.rs/sse2.rs so that those methods now just pass a pointer to the rng's pointer version. Somehow, it was a hair faster on macOS. I figure it would be easier to maintain if there was as little duplicated code as possible. The only remaining duplicated code in those files would be the inner and rng_inner methods. But they seem small enough to be okay.

nstilt1 avatar Dec 15 '23 17:12 nstilt1

Classic big endian vs little endian error :(

Not sure how the rng tests passed, unless they need to be in chacha20/tests

nstilt1 avatar Dec 16 '23 04:12 nstilt1