halo2
halo2 copied to clipboard
Heap profiling for memory bottlenecks discovery
Use ad-hoc heap allocation profiling to see how memory is being consumed and how can we optimize it.
See https://docs.rs/dhat/latest/dhat/ docs for more info
I think @ed255 actually did this and saw the place where memory spikes the most (when we hold Evaluation & Coefficient form of all our polynomials).
And IIRC, I saw that we can't really avoid that (although we can revisit ofc).
@ed255 Can you maybe post a summary here such taht we can trigger a discussion on how/if is worth solving this?
I used dhat to profile memory usage and make sure that the frontend-backend split didn't increase the memory consumption compared to the legacy halo2. My impressions of the profiler were:
- Positive: It's very easy to use, you just import it and create the profiler variable and that's it.
- Negative: Sometimes the report doesn't have the full stack trace of an allocation
I didn't analyze the profiling results to see which part of halo2 was using more memory, I was only paying attention to the comparison between frontend-backend split and legacy halo2; nevertheless on a different occasion I tried to reason about the biggest source of memory usage from halo2 and here are my conclusions
Constants:
- M: number of advice columns
- NOTE: This number will be bigger from implicit polynomials required for copy constraints and lookups; so it's more accurate if we say
M
is the number of committed polynomials.
- NOTE: This number will be bigger from implicit polynomials required for copy constraints and lookups; so it's more accurate if we say
- n: number of rows (2^k)
Once the circuit is big enough, this is the biggest source of memory usage:
- In proof creation we hold the witness matrix in memory. This is
M * n * sizeof(Assigned<F>)
- We convert this to
M * n * sizeof(F)
in thebatch_invert_assigned
step- This reduces the size, because
Assigned<F>
is~ 2 * sizeof(F)
- This reduces the size, because
- Then these vectors are committed, to do so they are extended
2^j
times (wherej
depends on the max expression degree of the circuit). I think the extended form is not kept in memory after commitment, but I'm not sure. - After all commitments are done, we perform an evaluation, which needs all the data from point (2).
From a discussion with @han0110 , he mentioned that Scroll did a trick to cap the max memory usage by: "computing the evaluations on extended domain and quotient chunk by chunk"; but I don't fully understand this idea :sweat_smile:
Nice write up!
I believe the optimization Han mentioned is https://github.com/scroll-tech/halo2/pull/28#issuecomment-1407968119 We have also observed it greatly reduces memory usage.
Nice write up!
I believe the optimization Han mentioned is scroll-tech#28 (comment) We have also observed it greatly reduces memory usage.
Working on porting this here! Trying to get good benchmarks to make avaliable with the PR.
Superseded by https://github.com/privacy-scaling-explorations/halo2/issues/291