halo2 icon indicating copy to clipboard operation
halo2 copied to clipboard

Heap profiling for memory bottlenecks discovery

Open CPerezz opened this issue 2 years ago • 4 comments

Use ad-hoc heap allocation profiling to see how memory is being consumed and how can we optimize it.

See https://docs.rs/dhat/latest/dhat/ docs for more info

CPerezz avatar Dec 23 '22 11:12 CPerezz

I think @ed255 actually did this and saw the place where memory spikes the most (when we hold Evaluation & Coefficient form of all our polynomials).

And IIRC, I saw that we can't really avoid that (although we can revisit ofc).

@ed255 Can you maybe post a summary here such taht we can trigger a discussion on how/if is worth solving this?

CPerezz avatar Feb 02 '24 13:02 CPerezz

I used dhat to profile memory usage and make sure that the frontend-backend split didn't increase the memory consumption compared to the legacy halo2. My impressions of the profiler were:

  • Positive: It's very easy to use, you just import it and create the profiler variable and that's it.
  • Negative: Sometimes the report doesn't have the full stack trace of an allocation

I didn't analyze the profiling results to see which part of halo2 was using more memory, I was only paying attention to the comparison between frontend-backend split and legacy halo2; nevertheless on a different occasion I tried to reason about the biggest source of memory usage from halo2 and here are my conclusions

Constants:

  • M: number of advice columns
    • NOTE: This number will be bigger from implicit polynomials required for copy constraints and lookups; so it's more accurate if we say M is the number of committed polynomials.
  • n: number of rows (2^k)

Once the circuit is big enough, this is the biggest source of memory usage:

  1. In proof creation we hold the witness matrix in memory. This is M * n * sizeof(Assigned<F>)
  2. We convert this to M * n * sizeof(F) in the batch_invert_assigned step
    • This reduces the size, because Assigned<F> is ~ 2 * sizeof(F)
  3. Then these vectors are committed, to do so they are extended 2^j times (where j depends on the max expression degree of the circuit). I think the extended form is not kept in memory after commitment, but I'm not sure.
  4. After all commitments are done, we perform an evaluation, which needs all the data from point (2).

From a discussion with @han0110 , he mentioned that Scroll did a trick to cap the max memory usage by: "computing the evaluations on extended domain and quotient chunk by chunk"; but I don't fully understand this idea :sweat_smile:

ed255 avatar Feb 05 '24 11:02 ed255

Nice write up!

I believe the optimization Han mentioned is https://github.com/scroll-tech/halo2/pull/28#issuecomment-1407968119 We have also observed it greatly reduces memory usage.

jonathanpwang avatar Feb 06 '24 04:02 jonathanpwang

Nice write up!

I believe the optimization Han mentioned is scroll-tech#28 (comment) We have also observed it greatly reduces memory usage.

Working on porting this here! Trying to get good benchmarks to make avaliable with the PR.

CPerezz avatar Feb 08 '24 10:02 CPerezz

Superseded by https://github.com/privacy-scaling-explorations/halo2/issues/291

davidnevadoc avatar Jun 11 '24 15:06 davidnevadoc