snarkVM
snarkVM copied to clipboard
from_bytes_le_mod_order is the source of a lot of allocations in snarkOS
While performing memory profiling for snarkOS, I noticed that a very large (>73% when running a client node) number of allocations is caused by PrimeField::from_bytes_le_mod_order.
That method is not used in many places, and I was able to apply the following change, only having to perform a few adjustments at the callsites:
diff --git a/fields/src/traits/prime_field.rs b/fields/src/traits/prime_field.rs
index 0bb21856d..d1b4c83c1 100644
--- a/fields/src/traits/prime_field.rs
+++ b/fields/src/traits/prime_field.rs
@@ -78,15 +78,15 @@ pub trait PrimeField:
/// Reads bytes in big-endian, and converts them to a field element.
/// If the bytes are larger than the modulus, it will reduce them.
- fn from_bytes_be_mod_order(bytes: &[u8]) -> Self {
+ fn from_bytes_be_mod_order(bytes: &mut [u8]) -> Self {
let num_modulus_bytes = ((Self::Parameters::MODULUS_BITS + 7) / 8) as usize;
let num_bytes_to_directly_convert = min(num_modulus_bytes - 1, bytes.len());
- let (leading_bytes, remaining_bytes) = bytes.split_at(num_bytes_to_directly_convert);
+ let (leading_bytes, remaining_bytes) = bytes.split_at_mut(num_bytes_to_directly_convert);
// Copy the leading big-endian bytes directly into a field element.
// The number of bytes directly converted must be less than the
// number of bytes needed to represent the modulus, as we must begin
// modular reduction once the data is of the same number of bytes as the modulus.
- let mut bytes_to_directly_convert = leading_bytes.to_vec();
+ let bytes_to_directly_convert = leading_bytes;
bytes_to_directly_convert.reverse();
/// If the bytes are larger than the modulus, it will reduce them.
- fn from_bytes_be_mod_order(bytes: &[u8]) -> Self {
+ fn from_bytes_be_mod_order(bytes: &mut [u8]) -> Self {
let num_modulus_bytes = ((Self::Parameters::MODULUS_BITS + 7) / 8) as usize;
let num_bytes_to_directly_convert = min(num_modulus_bytes - 1, bytes.len());
- let (leading_bytes, remaining_bytes) = bytes.split_at(num_bytes_to_directly_convert);
+ let (leading_bytes, remaining_bytes) = bytes.split_at_mut(num_bytes_to_directly_convert);
// Copy the leading big-endian bytes directly into a field element.
// The number of bytes directly converted must be less than the
// number of bytes needed to represent the modulus, as we must begin
// modular reduction once the data is of the same number of bytes as the modulus.
- let mut bytes_to_directly_convert = leading_bytes.to_vec();
+ let bytes_to_directly_convert = leading_bytes;
bytes_to_directly_convert.reverse();
// Guaranteed to not be None, as the input is less than the modulus size.
let mut res = Self::from_random_bytes(&bytes_to_directly_convert).unwrap();
@@ -103,9 +103,8 @@ pub trait PrimeField:
/// Reads bytes in little-endian, and converts them to a field element.
/// If the bytes are larger than the modulus, it will reduce them.
- fn from_bytes_le_mod_order(bytes: &[u8]) -> Self {
- let mut bytes_copy = bytes.to_vec();
- bytes_copy.reverse();
- Self::from_bytes_be_mod_order(&bytes_copy)
+ fn from_bytes_le_mod_order(bytes: &mut [u8]) -> Self {
+ bytes.reverse();
+ Self::from_bytes_be_mod_order(bytes)
}
}
While I'm not sure if this is ok API-wise, such a change would greatly reduce the number of allocations in snarkOS.
Does it make sense to keep the existing API, and use something like a smallvec instead?
That could work, provided that bytes is guaranteed to be shorter than some N; however, this would need to be tested, since we're talking about a lot of potentially concurrent calls, and we'd need to ensure that the stack can take it.
If you can give me the N, I'd be happy to test it.
The N would be at most 48, I think.
Ok, so I run some tests using SmallVec, and made the following observations:
- the
Nforfrom_bytes_le_mod_orderis actually64 - the
Nforfrom_bytes_be_mod_orderis31 - while these would remove the allocations, this still comes at the cost of CPU use (measured in cycles via
perf record); it's9.72%, while the current version uses13.47%, and the&mutversion only needs2.27%
What about taking bytes by value (i.e. mut bytes instead of the current &bytes)? That would allow it to be mutated and there wouldn't be any concern about it accidentally being used after having been mutated at the callsite.
Thanks for the comprehensive benches! I think passing bytes by value is a good idea, but it seems that the usages of these methods is quite varying: some times the input is a slice (eg here), sometimes it's an array (here), and sometimes it's a Vec (here). Not sure of a good way to unify these.
Another strategy would be to rewrite from_bytes_le_mod_order to avoid allocating a vec (just to reverse it), and then immediately passing the reference to that vec to from_bytes_be_mod_order. The simplest way to do this would be to make an helper function that takes things as &mut.
What about this? Including all the required changes in this diff, so all the adjustments are visible:
diff --git a/console/types/field/src/lib.rs b/console/types/field/src/lib.rs
index 5973b89fb..1b305aadb 100644
--- a/console/types/field/src/lib.rs
+++ b/console/types/field/src/lib.rs
@@ -57,7 +57,7 @@ impl<E: Environment> Field<E> {
/// Initializes a new field as a domain separator.
pub fn new_domain_separator(domain: &str) -> Self {
- Self::new(E::Field::from_bytes_le_mod_order(domain.as_bytes()))
+ Self::new(E::Field::from_bytes_le_mod_order(domain.to_owned().into_bytes()))
}
/// Initializes a new field from a `u8`.
diff --git a/fields/src/traits/poseidon_grain_lfsr.rs b/fields/src/traits/poseidon_grain_lfsr.rs
index 42d54d204..69d8a6111 100644
--- a/fields/src/traits/poseidon_grain_lfsr.rs
+++ b/fields/src/traits/poseidon_grain_lfsr.rs
@@ -150,7 +150,7 @@ impl PoseidonGrainLFSR {
.rev()
.collect::<Vec<u8>>();
- output.push(F::from_bytes_be_mod_order(&bytes));
+ output.push(F::from_bytes_be_mod_order(bytes));
}
Ok(output)
}
diff --git a/fields/src/traits/prime_field.rs b/fields/src/traits/prime_field.rs
index 0bb21856d..3f9838727 100644
--- a/fields/src/traits/prime_field.rs
+++ b/fields/src/traits/prime_field.rs
@@ -78,15 +78,15 @@ pub trait PrimeField:
/// Reads bytes in big-endian, and converts them to a field element.
/// If the bytes are larger than the modulus, it will reduce them.
- fn from_bytes_be_mod_order(bytes: &[u8]) -> Self {
+ fn from_bytes_be_mod_order<T: AsMut<[u8]>>(mut bytes: T) -> Self {
let num_modulus_bytes = ((Self::Parameters::MODULUS_BITS + 7) / 8) as usize;
- let num_bytes_to_directly_convert = min(num_modulus_bytes - 1, bytes.len());
- let (leading_bytes, remaining_bytes) = bytes.split_at(num_bytes_to_directly_convert);
+ let num_bytes_to_directly_convert = min(num_modulus_bytes - 1, bytes.as_mut().len());
+ let (leading_bytes, remaining_bytes) = bytes.as_mut().split_at_mut(num_bytes_to_directly_convert);
// Copy the leading big-endian bytes directly into a field element.
// The number of bytes directly converted must be less than the
// number of bytes needed to represent the modulus, as we must begin
// modular reduction once the data is of the same number of bytes as the modulus.
- let mut bytes_to_directly_convert = leading_bytes.to_vec();
+ let bytes_to_directly_convert = leading_bytes;
bytes_to_directly_convert.reverse();
// Guaranteed to not be None, as the input is less than the modulus size.
let mut res = Self::from_random_bytes(&bytes_to_directly_convert).unwrap();
@@ -103,9 +103,8 @@ pub trait PrimeField:
/// Reads bytes in little-endian, and converts them to a field element.
/// If the bytes are larger than the modulus, it will reduce them.
- fn from_bytes_le_mod_order(bytes: &[u8]) -> Self {
- let mut bytes_copy = bytes.to_vec();
- bytes_copy.reverse();
- Self::from_bytes_be_mod_order(&bytes_copy)
+ fn from_bytes_le_mod_order<T: AsMut<[u8]>>(mut bytes: T) -> Self {
+ bytes.as_mut().reverse();
+ Self::from_bytes_be_mod_order(bytes)
}
}
diff --git a/synthesizer/src/coinbase_puzzle/hash.rs b/synthesizer/src/coinbase_puzzle/hash.rs
index b6b93807a..a057737e5 100644
--- a/synthesizer/src/coinbase_puzzle/hash.rs
+++ b/synthesizer/src/coinbase_puzzle/hash.rs
@@ -34,7 +34,7 @@ pub fn hash_to_coefficients<F: PrimeField>(input: &[u8], num_coefficients: u32)
let mut input_with_counter = [0u8; 36];
input_with_counter[..32].copy_from_slice(&hash);
input_with_counter[32..].copy_from_slice(&counter.to_le_bytes());
- F::from_bytes_le_mod_order(&blake2::Blake2b512::digest(input_with_counter))
+ F::from_bytes_le_mod_order(blake2::Blake2b512::digest(input_with_counter))
})
.collect()
}
@@ -53,7 +53,7 @@ pub fn hash_commitment<E: PairingEngine>(commitment: &KZGCommitment<E>) -> Resul
ensure!(bytes.len() == 96, "Invalid commitment byte length for hashing");
// Return the hash of the commitment.
- Ok(E::Fr::from_bytes_le_mod_order(&blake2::Blake2b512::digest(&bytes)))
+ Ok(E::Fr::from_bytes_le_mod_order(blake2::Blake2b512::digest(&bytes)))
}
pub fn hash_commitments<E: PairingEngine>(
This way the API is clear about needing to consume the input, and anything that can be treated like &mut [u8] is compatible. Only Field::new_domain_separator would require an allocation, but it's still a great reduction in allocations overall.
Alternatively, from_bytes_le_mod_order can be written in a way that doesn't require any mutation whatsoever (it would need to be tested, but I basically just changed the split point and swapped the use of the slice parts, without a double reversal like it currently happens when going from from_bytes_le_mod_order to from_bytes_be_mod_order):
fn from_bytes_le_mod_order(bytes: &[u8]) -> Self {
let num_modulus_bytes = ((Self::Parameters::MODULUS_BITS + 7) / 8) as usize;
let num_bytes_to_directly_convert = min(num_modulus_bytes - 1, bytes.len());
let (leading_bytes, remaining_bytes) = bytes.split_at(bytes.len() - num_bytes_to_directly_convert);
// Copy the leading big-endian bytes directly into a field element.
// The number of bytes directly converted must be less than the
// number of bytes needed to represent the modulus, as we must begin
// modular reduction once the data is of the same number of bytes as the modulus.
let bytes_to_directly_convert = remaining_bytes;
// Guaranteed to not be None, as the input is less than the modulus size.
let mut res = Self::from_random_bytes(&bytes_to_directly_convert).unwrap();
// Update the result, byte by byte.
// We go through existing field arithmetic, which handles the reduction.
let window_size = Self::from(256u64);
for byte in leading_bytes {
res *= window_size;
res += Self::from(*byte);
}
res
}
While I was doing some more profiling, from_bytes_le_mod_order popped up again; I forgot that I already proposed the above adjustment and quickly tried out a functionally identical one (omitting comments):
fn from_bytes_le_mod_order(bytes: &[u8]) -> Self {
let num_modulus_bytes = ((Self::Parameters::MODULUS_BITS + 7) / 8) as usize;
let num_bytes_to_directly_convert = bytes.len() - min(num_modulus_bytes - 1, bytes.len());
let (remaining_bytes, leading_bytes) = bytes.split_at(num_bytes_to_directly_convert);
let mut res = Self::from_random_bytes(&leading_bytes).unwrap();
let window_size = Self::from(256u64);
for byte in remaining_bytes.iter().rev() {
res *= window_size;
res += Self::from(*byte);
}
res
}
and it reduced the CPU use coming from from_bytes_le_mod_order by around 1% - 1.5% (from ~6% to ~4.50%).
@ljedrz is this issue still relevant? Feel free to propose PRs if there are immediately opportunities to improve memory usage (without impact to runtime performance).
I don't see it in my current memory profiles, but it's likely that it will pop up again under more organic workloads; in any case, the implementation hasn't changed, so I'm expecting it to still be relevant if used often enough.