ethsnarks Streaming Prover, reduce memory overhead of prover

See:

https://github.com/zcash/zcash/pull/2243/commits/5ddd8908407077028fa48e4d129b6c272dee3841
https://github.com/zcash/zcash/issues/2679
https://github.com/zcash/zcash/issues/2235
https://github.com/zcash/zcash/pull/2243/commits/929b9dbb15fe5b58db2c783c851cc85b3d12de1a
https://github.com/zcash/zcash/pull/2243/commits/5ddd8908407077028fa48e4d129b6c272dee3841

There are optimisations which can be made which reduce the necessary amount of in-memory data required for the proving key.

Don't use Jacobian coordinates for the proving key while it's in-memory (removes the X coord), allocate X coord on stack when performing calculation
Use a directly accessible format which allows the proving key to be mmap'd and let the kernel figure it out.
Anything which is used independently and sequentially means the rest can be discarded:

The proving key consists of five parts - PK_A, PK_B, PK_C, PK_K, PK_H which are used independently and sequentially.

Load each one at a time, rather than all of them.

Either way, need to create a benchmark for prove which tracks the peak memory usage.

Any changes should avoid modifying libsnark directly, we still want to use it as a submodule.

With the mmap approach, if the offsets & sizes of sections for A, B, C, K and H are known ahead of time, then madvise can be used depending on the access pattern. If it's purely sequential then MADV_SEQUENTIAL can be used, however if it's at random then doing a pre-emptive sequential read into memory will offer a performance improvement.

Either way - disk reads vs memory usage trade-off again.

Jul 19 '18 15:07 HarryR

After some testing it seems the disk load of the proving key is nearly instant, however when using the standard ostream and istream serialisation it takes a significant amount of time to read from the stringstream into the object.

Part of the problem is the BINARY_OUTPUT flag in bigint.cc makes the difference between loading from a string representation versus a binary representation. However, this flag already seems to be enabled by default!

I think if this could be sped up then the prove / verify time would be reduced significantly, and more importantly in relation to the number of circuits. How can it take 1 minute (after loading all the data from disk) to parse 2.5m lines? With a larger number of circuits this will take even longer...

Jul 20 '18 16:07 HarryR

So:

alt_bn128_q_limbs is (alt_bn128_q_bitcoin+GMP_NUMB_BITS-1)/GMP_NUMB_BITS
bigint<n> has mp_limb_t data[n]
Fp_model has bigint<n> mont_repr where n is alt_bn128_q_limbs (for alt_bn128)
alt_bn128_Fq is Fp_model<alt_bn128_q_limbs, alt_bn128_modulus_q>
alt_bn128_g1 has X, Y and Z which are alt_bn128_Fq types.

None of the classes above have any other variables other than those noted.

So, assuming tight packing, an alt_bn128_g1 point should be sizeof(mp_limb_t) * n * 3 memory. Similarly alt_bn128_g2 should be sizeof(mp_limb_t) * n * 6. This means that a reinterpret_cast<alt_bn128_g1> against mp_limb_t[n*3] should be usable as an alt_bn128_g1 object without problems.

If these assumptions are true, then it should be possible to dump the proving key in a way which can be mmap'd. The problem is that, when doing this, the compiler may adjust for alignment (assuming mp_limb_t is native ptr size, it shouldn't), the problem is that {A,B,C}_query are of Boost::sparse_vector type, and {H,K}_query are of std::vector type.

Need to investigate the density of the {A,B,C}_query sparse vectors

Jul 20 '18 16:07 HarryR

With the LongsightF gadget this seems like less of a priority as the serialised circuit is quite small compared to the full SHA256 merkle tree proof gadget.

Aug 19 '18 08:08 HarryR