password-hashes icon indicating copy to clipboard operation
password-hashes copied to clipboard

Argon2 refactor

Open Pjottos opened this issue 4 years ago • 5 comments

Practically a rewrite of the argon2 crate.

Summary of changes:

  • Parallel implementation is safer by not creating mutable aliases of self, but instead separating lanes into a mutable slice (current segment) and 2 immutable slices (before and after the segment).
  • The Instance struct has been merged with the Argon2 struct since it was basically just a copy of it.
  • The Argon2 struct has a new method, fill_memory, which omits the calculation of a hash. This is used in e.g. RandomX.
  • Methods on ParamsBuilder don't return Results anymore and take self by reference to make it more ergonomic.

Pjottos avatar Sep 30 '21 11:09 Pjottos

Looks promising at first glance. I can review in depth whenever you're finished.

tarcieri avatar Sep 30 '21 14:09 tarcieri

Yeah it's ready.

Pjottos avatar Oct 01 '21 09:10 Pjottos

Regarding the use of unsafe code and mutable aliasing, I'd suggest taking a look at this comment I left when the parallel implementation was originally added:

https://github.com/RustCrypto/password-hashes/pull/149#issuecomment-808910694

I think there's a safe strategy possible, even in a multithreaded context, and I would really like to see something like MemoryView use that. Requiring unsafe all the time, even when having differing implementations of MemoryView depending on whether the parallel feature is enabled, seems like a backstep.

My suggestion would be borrowing the Argon2 memory mutably, then mutably borrow splitting it into Argon2 "slices" (as described in the paper) using either chunks_mut/chunks_exact_mut or split_at_mut. Once it's been split into 4 mutable Argon2 "slices", 3 of the slices can be reborrowed as immutable and shared among threads. The remaining mutable slice needs to be further split into Argon2 "segments" and given to each worker thread, along with the other 3 Argon2 "slices".

tarcieri avatar Oct 02 '21 22:10 tarcieri

I had considered splitting the memory that way but the issue i ran into is that an Argon2 slice consists of a number of Rust slices, one for each lane, and I didn't want to use an allocation to keep track of the references when it could be avoided with unsafe code.

I suppose the problem lies in the memory layout since Argon2 slices divide the memory into columns and lanes divide it into rows, while it would make more sense for that to be swapped. Changing the memory layout might be the key to a safe implementation, but it would mean the memory has to be copied into the standard layout for the fill_memory method, and the block indexing logic would have to be changed.

Pjottos avatar Oct 03 '21 12:10 Pjottos

I would prefer not to have two implementations, one parallel and one sequential, which take &self and &mut self respectively and where the former parallel relies on atomics in the course of doing what is effectively pointer arithmetic.

Instead, I think it would make more sense to have a MemoryView type which knows at the level of an individual worker thread what memory is allowed to be viewed or mutated, and for those types to have their own internal bookkeeping which can be implemented with &mut self instead of atomics.

It's okay for such a type to use unsafe if necessary, but as things stand there are two different implementations and a lot of unsafe code to view. I would prefer it be kept to an absolute minimum, or completely eliminated if possible.

tarcieri avatar Oct 14 '21 21:10 tarcieri

@Pjottos saw you pushed some semi-recent changes to this branch. I looked through them and it's really looking great!

I'm fine to merge this pretty much as-is, but I had a couple notes:

  • It looks like this removes the parallel implementation. That's fine actually! It means a PRs with a proposed parallel implementation will be easier to review because they're self-contained. But perhaps can you go all the way and completely remove the parallel feature and rayon dependency and re-add them in a self-contained PR with a parallel implementation (that's something I've wanted to explore, but haven't had time).
  • I'd prefer not to add a dependency on byte-slice-cast, especially as the functionality it's providing seems trivial (<10 LOC or so I'd say).

Otherwise great job!

tarcieri avatar Aug 27 '22 21:08 tarcieri

That sounds good! I actually found what seems like a functional parallel implementation as uncommitted changes in my working directory, I could take another look at it if this gets merged and open a PR.

Pjottos avatar Aug 28 '22 02:08 Pjottos

Thank you! I'm going to submit a followup with a few small fixes

tarcieri avatar Aug 28 '22 18:08 tarcieri