Streaming Prover, reduce memory overhead of prover
See:
- https://github.com/zcash/zcash/pull/2243/commits/5ddd8908407077028fa48e4d129b6c272dee3841
- https://github.com/zcash/zcash/issues/2679
- https://github.com/zcash/zcash/issues/2235
- https://github.com/zcash/zcash/pull/2243/commits/929b9dbb15fe5b58db2c783c851cc85b3d12de1a
- https://github.com/zcash/zcash/pull/2243/commits/5ddd8908407077028fa48e4d129b6c272dee3841
There are optimisations which can be made which reduce the necessary amount of in-memory data required for the proving key.
- Don't use Jacobian coordinates for the proving key while it's in-memory (removes the X coord), allocate X coord on stack when performing calculation
- Use a directly accessible format which allows the proving key to be
mmap'd and let the kernel figure it out. - Anything which is used independently and sequentially means the rest can be discarded:
The proving key consists of five parts - PK_A, PK_B, PK_C, PK_K, PK_H which are used independently and sequentially.
Load each one at a time, rather than all of them.
Either way, need to create a benchmark for prove which tracks the peak memory usage.
Any changes should avoid modifying libsnark directly, we still want to use it as a submodule.
With the mmap approach, if the offsets & sizes of sections for A, B, C, K and H are known ahead of time, then madvise can be used depending on the access pattern. If it's purely sequential then MADV_SEQUENTIAL can be used, however if it's at random then doing a pre-emptive sequential read into memory will offer a performance improvement.
Either way - disk reads vs memory usage trade-off again.
After some testing it seems the disk load of the proving key is nearly instant, however when using the standard ostream and istream serialisation it takes a significant amount of time to read from the stringstream into the object.
Part of the problem is the BINARY_OUTPUT flag in bigint.cc makes the difference between loading from a string representation versus a binary representation. However, this flag already seems to be enabled by default!
I think if this could be sped up then the prove / verify time would be reduced significantly, and more importantly in relation to the number of circuits. How can it take 1 minute (after loading all the data from disk) to parse 2.5m lines? With a larger number of circuits this will take even longer...
So:
-
alt_bn128_q_limbsis(alt_bn128_q_bitcoin+GMP_NUMB_BITS-1)/GMP_NUMB_BITS -
bigint<n>hasmp_limb_t data[n] -
Fp_modelhasbigint<n> mont_reprwherenisalt_bn128_q_limbs(for alt_bn128) -
alt_bn128_FqisFp_model<alt_bn128_q_limbs, alt_bn128_modulus_q> -
alt_bn128_g1hasX,YandZwhich arealt_bn128_Fqtypes.
None of the classes above have any other variables other than those noted.
So, assuming tight packing, an alt_bn128_g1 point should be sizeof(mp_limb_t) * n * 3 memory. Similarly alt_bn128_g2 should be sizeof(mp_limb_t) * n * 6. This means that a reinterpret_cast<alt_bn128_g1> against mp_limb_t[n*3] should be usable as an alt_bn128_g1 object without problems.
If these assumptions are true, then it should be possible to dump the proving key in a way which can be mmap'd. The problem is that, when doing this, the compiler may adjust for alignment (assuming mp_limb_t is native ptr size, it shouldn't), the problem is that {A,B,C}_query are of Boost::sparse_vector type, and {H,K}_query are of std::vector type.
Need to investigate the density of the {A,B,C}_query sparse vectors
With the LongsightF gadget this seems like less of a priority as the serialised circuit is quite small compared to the full SHA256 merkle tree proof gadget.