snark icon indicating copy to clipboard operation
snark copied to clipboard

parallelize inline_all_lcs() API

Open brucechin opened this issue 3 years ago • 9 comments

Hello, recently I have been working on a project consisting of a huge number of constraints and linear combinations. Before this issue I asked for improve memory consumption in https://github.com/arkworks-rs/snark/issues/324. However, the setup phase to generate CRS is very slow and I found that most of the time is spent on the inline_all_lcs() function.

Previously, by removing unnecessary lcs the memory consumption is greatly reduced(like ~90%). However, within this function, it is still iterating self.lc_map using a single thread. https://github.com/arkworks-rs/snark/blob/54a0b1a1267a258c87d118d9d5886d443a8e3c36/relations/src/r1cs/constraint_system.rs#L288

May I ask if the developer teams have any plans to parallelize this function? It could improve the performance a lot just by carefully atomically insert and remove elements into/from the transformed_lc_map with a threadpool. This feature should be easy to implement for someone who is familiar with concurrent programming with Rust. Thanks!

brucechin avatar Jan 03 '21 04:01 brucechin

Ah, it is a little bit hard to rewrite this function in a parallel way (somehow linear).

One potential temporary solution is to avoid adding many FpVars one by one, but instead to take the sum as a witness variable and enforce a constraint over it. Currently, adding two FpVar would create a new symbolic LC.

(By the way, for matrix multiplication, you may want to take a look at Virgo [https://github.com/sunblaze-ucb/Virgo])

weikengchen avatar Jan 03 '21 05:01 weikengchen

AFAIU, the correctness guarantees of that function as written require it to be executed sequentially.

Theres a couple things I'd like to understand:

  1. How many constraints are there vs how many symbolic LC's are their in this constraint system
  2. (Presumably) how are we getting so many symbolic LC's, and are there ways we can avoid creating so many. (If its all coming from one gadget / usage pattern, can we make that gadget just introduce fewer symbolic LC's for now, in absence of a general solution)
  3. What is taking all the time in inline_all_lcs. Is it mallocs? If so, maybe we can reduce the number of distinct malloc calls (pooling / allocating more at once). We can definitely reduce the amount of data #122. Is it BTree searches calls? If so, how would that compare against alternate data structures.

An approach specific to improving with parallelism, is could we build / finalize multiple "sub-constraint-systems" in parallel, and have some combining procedure. (With different handlings for shared, and non-shared variables in the combining process)

ValarDragon avatar Jan 03 '21 05:01 ValarDragon

Ah, it is a little bit hard to rewrite this function in a parallel way (somehow linear).

One potential temporary solution is to avoid adding many FpVars one by one, but instead to take the sum as a witness variable and enforce a constraint over it. Currently, adding two FpVar would create a new symbolic LC.

(By the way, for matrix multiplication, you may want to take a look at Virgo [https://github.com/sunblaze-ucb/Virgo])

I thought, for example, vector A dot product vector B = res. in snark I need to create FpVar for each element in A and B. But if the dot product length is very long, inlining those linear combinations takes much time.

brucechin avatar Jan 03 '21 06:01 brucechin

AFAIU, the correctness guarantees of that function as written require it to be executed sequentially.

Theres a couple things I'd like to understand:

  1. How many constraints are there vs how many symbolic LC's are their in this constraint system
  2. (Presumably) how are we getting so many symbolic LC's, and are there ways we can avoid creating so many. (If its all coming from one gadget / usage pattern, can we make that gadget just introduce fewer symbolic LC's for now, in absence of a general solution)
  3. What is taking all the time in inline_all_lcs. Is it mallocs? If so, maybe we can reduce the number of distinct malloc calls (pooling / allocating more at once). We can definitely reduce the amount of data #122. Is it BTree searches calls? If so, how would that compare against alternate data structures.

An approach specific to improving with parallelism, is could we build / finalize multiple "sub-constraint-systems" in parallel, and have some combining procedure. (With different handlings for shared, and non-shared variables in the combining process)

  1. because I use constant wire * witness in the vector dot product, the number of constraints is not a large number. But the number of linear combinations could be thousands of times larger.
  2. because the matrix is so large. so I have to create many constant/witness wires for multiplication
  3. i have not done detailed profiling yet, so I can not provide some concrete profiling report for inline_all_lcs() function here.

brucechin avatar Jan 03 '21 06:01 brucechin

Ah, it is a little bit hard to rewrite this function in a parallel way (somehow linear). One potential temporary solution is to avoid adding many FpVars one by one, but instead to take the sum as a witness variable and enforce a constraint over it. Currently, adding two FpVar would create a new symbolic LC. (By the way, for matrix multiplication, you may want to take a look at Virgo [https://github.com/sunblaze-ucb/Virgo])

I thought, for example, vector A dot product vector B = res. in snark I need to create FpVar for each element in A and B.

fn constant_mul_cs_helper_u8(cs: ConstraintSystemRef<Fq>, a: u8, constant: u8) -> FqVar {
    let aa: Fq = a.into();
    let cc: Fq = constant.into();
    let a_var = FpVar::<Fq>::new_witness(r1cs_core::ns!(cs, "a gadget"), || Ok(aa)).unwrap();
    let c_var = FpVar::Constant(cc);
    a_var.mul(c_var)
}

fn scala_cs_helper_u8(
    cs: ConstraintSystemRef<Fq>,
    input: &[u8],
    weight: &[u8],
) -> FqVar {
    let mut tmp1 =
        FpVar::<Fq>::new_witness(r1cs_core::ns!(cs, "q1*q2 gadget"), || Ok(Fq::zero())).unwrap();
    for i in 0..input.len() {
        tmp1 += constant_mul_cs_helper_u8(cs.clone(), input[i], weight[i]);
    }
    tmp1
}

Although the above code does not generate too many constraints because they are all constant wire multiplying witness. But if the dot product length is very long, inlining those linear combinations takes much time.

Yes. The situation is that the current system generates many symbolic LCs for it. When a FpVar is multiplied by a constant, a new LC is created. For every +=, a new symbolic LC is created. Yeah, in fact, the sum is merely a linear combination of the input. It should require only one constraint, and one LC.

weikengchen avatar Jan 03 '21 07:01 weikengchen

Yeah. And the performance suffers a lot from such a huge amount of symbolic LC. I thought the constant * FpVar is almost for free but it turns out to be not that negligible:( For this specific case, do you have any suggestions to reduce the number of LCs? I think it is incorrect to accumulate the dot product using uint64 and only turn the accumulated result into FpVar.

brucechin avatar Jan 04 '21 04:01 brucechin

Some microbenchmarking results:

constant vector * witness vector length of 10000 : CRS generation time ~60 seconds, prove time ~60 seconds, which is far from almost free. witness vector * witness vector length of 10000 : CRS generation time ~80 seconds, prove time ~70 seconds.

during inline_all_lcs() function, ~90% of the execution time is spent on https://github.com/arkworks-rs/snark/blob/8d9055d5397b510716ad2951ce1f18675aebe7c8/relations/src/r1cs/constraint_system.rs#L343

brucechin avatar Jan 05 '21 08:01 brucechin

Hey @brucechin , could you provide a link to the benchmark? Those numbers seem terrible

Pratyush avatar Jan 05 '21 15:01 Pratyush

This is the microbenchmark I wrote: https://github.com/brucechin/vector-dot-product-microbench/tree/master

brucechin avatar Jan 06 '21 00:01 brucechin