snarkVM from_bytes_le_mod_order is the source of a lot of allocations in snarkOS

While performing memory profiling for snarkOS, I noticed that a very large (>73% when running a client node) number of allocations is caused by PrimeField::from_bytes_le_mod_order.

That method is not used in many places, and I was able to apply the following change, only having to perform a few adjustments at the callsites:

diff --git a/fields/src/traits/prime_field.rs b/fields/src/traits/prime_field.rs
index 0bb21856d..d1b4c83c1 100644
--- a/fields/src/traits/prime_field.rs
+++ b/fields/src/traits/prime_field.rs
@@ -78,15 +78,15 @@ pub trait PrimeField:
 
     /// Reads bytes in big-endian, and converts them to a field element.
     /// If the bytes are larger than the modulus, it will reduce them.
-    fn from_bytes_be_mod_order(bytes: &[u8]) -> Self {
+    fn from_bytes_be_mod_order(bytes: &mut [u8]) -> Self {
         let num_modulus_bytes = ((Self::Parameters::MODULUS_BITS + 7) / 8) as usize;
         let num_bytes_to_directly_convert = min(num_modulus_bytes - 1, bytes.len());
-        let (leading_bytes, remaining_bytes) = bytes.split_at(num_bytes_to_directly_convert);
+        let (leading_bytes, remaining_bytes) = bytes.split_at_mut(num_bytes_to_directly_convert);
         // Copy the leading big-endian bytes directly into a field element.
         // The number of bytes directly converted must be less than the
         // number of bytes needed to represent the modulus, as we must begin
         // modular reduction once the data is of the same number of bytes as the modulus.
-        let mut bytes_to_directly_convert = leading_bytes.to_vec();
+        let bytes_to_directly_convert = leading_bytes;
         bytes_to_directly_convert.reverse();
     /// If the bytes are larger than the modulus, it will reduce them.
-    fn from_bytes_be_mod_order(bytes: &[u8]) -> Self {
+    fn from_bytes_be_mod_order(bytes: &mut [u8]) -> Self {
         let num_modulus_bytes = ((Self::Parameters::MODULUS_BITS + 7) / 8) as usize;
         let num_bytes_to_directly_convert = min(num_modulus_bytes - 1, bytes.len());
-        let (leading_bytes, remaining_bytes) = bytes.split_at(num_bytes_to_directly_convert);
+        let (leading_bytes, remaining_bytes) = bytes.split_at_mut(num_bytes_to_directly_convert);
         // Copy the leading big-endian bytes directly into a field element.
         // The number of bytes directly converted must be less than the
         // number of bytes needed to represent the modulus, as we must begin
         // modular reduction once the data is of the same number of bytes as the modulus.
-        let mut bytes_to_directly_convert = leading_bytes.to_vec();
+        let bytes_to_directly_convert = leading_bytes;
         bytes_to_directly_convert.reverse();
         // Guaranteed to not be None, as the input is less than the modulus size.
         let mut res = Self::from_random_bytes(&bytes_to_directly_convert).unwrap();
@@ -103,9 +103,8 @@ pub trait PrimeField:
 
     /// Reads bytes in little-endian, and converts them to a field element.
     /// If the bytes are larger than the modulus, it will reduce them.
-    fn from_bytes_le_mod_order(bytes: &[u8]) -> Self {
-        let mut bytes_copy = bytes.to_vec();
-        bytes_copy.reverse();
-        Self::from_bytes_be_mod_order(&bytes_copy)
+    fn from_bytes_le_mod_order(bytes: &mut [u8]) -> Self {
+        bytes.reverse();
+        Self::from_bytes_be_mod_order(bytes)
     }
 }

While I'm not sure if this is ok API-wise, such a change would greatly reduce the number of allocations in snarkOS.

Jan 26 '23 15:01 ljedrz

Does it make sense to keep the existing API, and use something like a smallvec instead?

Jan 26 '23 16:01 Pratyush

That could work, provided that bytes is guaranteed to be shorter than some N; however, this would need to be tested, since we're talking about a lot of potentially concurrent calls, and we'd need to ensure that the stack can take it.

If you can give me the N, I'd be happy to test it.

Jan 26 '23 16:01 ljedrz

The N would be at most 48, I think.

Jan 26 '23 16:01 Pratyush

Ok, so I run some tests using SmallVec, and made the following observations:

the N for from_bytes_le_mod_order is actually 64
the N for from_bytes_be_mod_order is 31
while these would remove the allocations, this still comes at the cost of CPU use (measured in cycles via perf record); it's 9.72%, while the current version uses 13.47%, and the &mut version only needs 2.27%

What about taking bytes by value (i.e. mut bytes instead of the current &bytes)? That would allow it to be mutated and there wouldn't be any concern about it accidentally being used after having been mutated at the callsite.

Jan 26 '23 17:01 ljedrz

Thanks for the comprehensive benches! I think passing bytes by value is a good idea, but it seems that the usages of these methods is quite varying: some times the input is a slice (eg here), sometimes it's an array (here), and sometimes it's a Vec (here). Not sure of a good way to unify these.

Jan 26 '23 18:01 Pratyush

Another strategy would be to rewrite from_bytes_le_mod_order to avoid allocating a vec (just to reverse it), and then immediately passing the reference to that vec to from_bytes_be_mod_order. The simplest way to do this would be to make an helper function that takes things as &mut.

Jan 26 '23 18:01 Pratyush

What about this? Including all the required changes in this diff, so all the adjustments are visible:

diff --git a/console/types/field/src/lib.rs b/console/types/field/src/lib.rs
index 5973b89fb..1b305aadb 100644
--- a/console/types/field/src/lib.rs
+++ b/console/types/field/src/lib.rs
@@ -57,7 +57,7 @@ impl<E: Environment> Field<E> {
 
     /// Initializes a new field as a domain separator.
     pub fn new_domain_separator(domain: &str) -> Self {
-        Self::new(E::Field::from_bytes_le_mod_order(domain.as_bytes()))
+        Self::new(E::Field::from_bytes_le_mod_order(domain.to_owned().into_bytes()))
     }
 
     /// Initializes a new field from a `u8`.
diff --git a/fields/src/traits/poseidon_grain_lfsr.rs b/fields/src/traits/poseidon_grain_lfsr.rs
index 42d54d204..69d8a6111 100644
--- a/fields/src/traits/poseidon_grain_lfsr.rs
+++ b/fields/src/traits/poseidon_grain_lfsr.rs
@@ -150,7 +150,7 @@ impl PoseidonGrainLFSR {
                 .rev()
                 .collect::<Vec<u8>>();
 
-            output.push(F::from_bytes_be_mod_order(&bytes));
+            output.push(F::from_bytes_be_mod_order(bytes));
         }
         Ok(output)
     }
diff --git a/fields/src/traits/prime_field.rs b/fields/src/traits/prime_field.rs
index 0bb21856d..3f9838727 100644
--- a/fields/src/traits/prime_field.rs
+++ b/fields/src/traits/prime_field.rs
@@ -78,15 +78,15 @@ pub trait PrimeField:
 
     /// Reads bytes in big-endian, and converts them to a field element.
     /// If the bytes are larger than the modulus, it will reduce them.
-    fn from_bytes_be_mod_order(bytes: &[u8]) -> Self {
+    fn from_bytes_be_mod_order<T: AsMut<[u8]>>(mut bytes: T) -> Self {
         let num_modulus_bytes = ((Self::Parameters::MODULUS_BITS + 7) / 8) as usize;
-        let num_bytes_to_directly_convert = min(num_modulus_bytes - 1, bytes.len());
-        let (leading_bytes, remaining_bytes) = bytes.split_at(num_bytes_to_directly_convert);
+        let num_bytes_to_directly_convert = min(num_modulus_bytes - 1, bytes.as_mut().len());
+        let (leading_bytes, remaining_bytes) = bytes.as_mut().split_at_mut(num_bytes_to_directly_convert);
         // Copy the leading big-endian bytes directly into a field element.
         // The number of bytes directly converted must be less than the
         // number of bytes needed to represent the modulus, as we must begin
         // modular reduction once the data is of the same number of bytes as the modulus.
-        let mut bytes_to_directly_convert = leading_bytes.to_vec();
+        let bytes_to_directly_convert = leading_bytes;
         bytes_to_directly_convert.reverse();
         // Guaranteed to not be None, as the input is less than the modulus size.
         let mut res = Self::from_random_bytes(&bytes_to_directly_convert).unwrap();
@@ -103,9 +103,8 @@ pub trait PrimeField:
 
     /// Reads bytes in little-endian, and converts them to a field element.
     /// If the bytes are larger than the modulus, it will reduce them.
-    fn from_bytes_le_mod_order(bytes: &[u8]) -> Self {
-        let mut bytes_copy = bytes.to_vec();
-        bytes_copy.reverse();
-        Self::from_bytes_be_mod_order(&bytes_copy)
+    fn from_bytes_le_mod_order<T: AsMut<[u8]>>(mut bytes: T) -> Self {
+        bytes.as_mut().reverse();
+        Self::from_bytes_be_mod_order(bytes)
     }
 }
diff --git a/synthesizer/src/coinbase_puzzle/hash.rs b/synthesizer/src/coinbase_puzzle/hash.rs
index b6b93807a..a057737e5 100644
--- a/synthesizer/src/coinbase_puzzle/hash.rs
+++ b/synthesizer/src/coinbase_puzzle/hash.rs
@@ -34,7 +34,7 @@ pub fn hash_to_coefficients<F: PrimeField>(input: &[u8], num_coefficients: u32)
             let mut input_with_counter = [0u8; 36];
             input_with_counter[..32].copy_from_slice(&hash);
             input_with_counter[32..].copy_from_slice(&counter.to_le_bytes());
-            F::from_bytes_le_mod_order(&blake2::Blake2b512::digest(input_with_counter))
+            F::from_bytes_le_mod_order(blake2::Blake2b512::digest(input_with_counter))
         })
         .collect()
 }
@@ -53,7 +53,7 @@ pub fn hash_commitment<E: PairingEngine>(commitment: &KZGCommitment<E>) -> Resul
     ensure!(bytes.len() == 96, "Invalid commitment byte length for hashing");
 
     // Return the hash of the commitment.
-    Ok(E::Fr::from_bytes_le_mod_order(&blake2::Blake2b512::digest(&bytes)))
+    Ok(E::Fr::from_bytes_le_mod_order(blake2::Blake2b512::digest(&bytes)))
 }
 
 pub fn hash_commitments<E: PairingEngine>(

This way the API is clear about needing to consume the input, and anything that can be treated like &mut [u8] is compatible. Only Field::new_domain_separator would require an allocation, but it's still a great reduction in allocations overall.

Jan 27 '23 10:01 ljedrz

Alternatively, from_bytes_le_mod_order can be written in a way that doesn't require any mutation whatsoever (it would need to be tested, but I basically just changed the split point and swapped the use of the slice parts, without a double reversal like it currently happens when going from from_bytes_le_mod_order to from_bytes_be_mod_order):

fn from_bytes_le_mod_order(bytes: &[u8]) -> Self {
    let num_modulus_bytes = ((Self::Parameters::MODULUS_BITS + 7) / 8) as usize;
    let num_bytes_to_directly_convert = min(num_modulus_bytes - 1, bytes.len());
    let (leading_bytes, remaining_bytes) = bytes.split_at(bytes.len() - num_bytes_to_directly_convert);
    // Copy the leading big-endian bytes directly into a field element.
    // The number of bytes directly converted must be less than the
    // number of bytes needed to represent the modulus, as we must begin
    // modular reduction once the data is of the same number of bytes as the modulus.
    let bytes_to_directly_convert = remaining_bytes;
    // Guaranteed to not be None, as the input is less than the modulus size.
    let mut res = Self::from_random_bytes(&bytes_to_directly_convert).unwrap();

    // Update the result, byte by byte.
    // We go through existing field arithmetic, which handles the reduction.
    let window_size = Self::from(256u64);
    for byte in leading_bytes {
        res *= window_size;
        res += Self::from(*byte);
    }
    res
}

Jan 27 '23 12:01 ljedrz

While I was doing some more profiling, from_bytes_le_mod_order popped up again; I forgot that I already proposed the above adjustment and quickly tried out a functionally identical one (omitting comments):

fn from_bytes_le_mod_order(bytes: &[u8]) -> Self {
    let num_modulus_bytes = ((Self::Parameters::MODULUS_BITS + 7) / 8) as usize;
    let num_bytes_to_directly_convert = bytes.len() - min(num_modulus_bytes - 1, bytes.len());
    let (remaining_bytes, leading_bytes) = bytes.split_at(num_bytes_to_directly_convert);

    let mut res = Self::from_random_bytes(&leading_bytes).unwrap();
    let window_size = Self::from(256u64);
    for byte in remaining_bytes.iter().rev() {
        res *= window_size;
        res += Self::from(*byte);
    }
    res
}

and it reduced the CPU use coming from from_bytes_le_mod_order by around 1% - 1.5% (from ~6% to ~4.50%).

Apr 04 '23 09:04 ljedrz

@ljedrz is this issue still relevant? Feel free to propose PRs if there are immediately opportunities to improve memory usage (without impact to runtime performance).

Oct 10 '23 00:10 howardwu

I don't see it in my current memory profiles, but it's likely that it will pop up again under more organic workloads; in any case, the implementation hasn't changed, so I'm expecting it to still be relevant if used often enough.

Oct 10 '23 10:10 ljedrz

snarkVM snarkVM copied to clipboard

from_bytes_le_mod_order is the source of a lot of allocations in snarkOS

snarkVM
snarkVM copied to clipboard