tink Re-use buffers in high-throughput execution paths.

This change attempts to avoid massively wasted one-time use memory by reusing very commonly recreated buffers.

This issue was found with a test program that runs Ed25519::verify once per second, which Java Flight Recorder shows will leave ~120MiB of dead memory lying around after five minutes. This ramps up DRAMATICALLY with traffic.

Oct 04 '19 13:10 eeSeeGee

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

Oct 04 '19 13:10 googlebot

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

Oct 04 '19 13:10 googlebot

Thanks for improving Tink! This looks good to me.

How much memory are we saving with this PR?

@bleichen or @tholenst, any opinions?

Oct 16 '19 22:10 thaidn

I'm not happy that this might be needed. Of course, the Aaron's point is that it is needed :/ However, at the moment I would still somewhat prefer not to merge this.

The problems I see are: (1) In general, it increases code complexity, and hence likelyhood of error. (2) It makes things more brittle; if someone decides that it would be useful to call "add" in some uncommonly taken code path in "mult", the buffer will be overwritten and things break (3) It solves a problem of the JVM which is currently present; however, nothing says that this problem is still present in some unspecified point in the future. In fact, the more cores your machine has, the more memory this wastes.

Would an alternative solution be to allow the offending functions to take a "scratch space" buffer? The advantage would be that then the dependency would be explicit, and we wouldn't need Thread Locals.

Another solution might be to have a non-thread safe executor object which includes a buffer. Then, instead of the something like:

  Field25519.square(u, y);
  Field25519.mult(v, u, D);
  Field25519.sub(u, u, z); // u = y^2 - 1
  Field25519.sum(v, v, z); // v = dy^2 + 1

one would write:

  Field25519Calculator calculator = new Field25519Calculator();
  calculator.square(u, y);
  calculator.mult(v, u, D);
  calculator.sub(u, u, z); // u = y^2 - 1
  calculator.sum(v, v, z); // v = dy^2 + 1

I think this would also be more intuitive.

Oct 17 '19 08:10 tholenst

I'm even surprised that such changes are actually helpful and not in fact hurtful. I.e. the current code looks like a simple case for an escape analysis. E.g. https://www.cc.gatech.edu/~harrold/6340/cs6340_fall2009/Readings/choi99escape.pdf My impression is that java compilers generally aim for "write clean code and let the compiler take care of the rest". I don't know whether the author of the Eddsa code was running any benchmarks. It would probably make some sense to compare different platforms.

On Thu, Oct 17, 2019 at 11:00 AM Thomas Holenstein [email protected] wrote:

I'm not happy that this might be needed. Of course, the Aaron's point is that it is needed :/ However, at the moment I would still somewhat prefer not to merge this.

The problems I see are: (1) In general, it increases code complexity, and hence likelyhood of error. (2) It makes things more brittle; if someone decides that it would be useful to call "add" in some uncommonly taken code path in "mult", the buffer will be overwritten and things break (3) It solves a problem of the JVM which is currently present; however, nothing says that this problem is still present in some unspecified point in the future. In fact, the more cores your machine has, the more memory this wastes.

Would an alternative solution be to allow the offending functions to take a "scratch space" buffer? The advantage would be that then the dependency would be explicit, and we wouldn't need Thread Locals.

Another solution might be to have a non-thread safe executor object which includes a buffer. Then, instead of the something like:

Field25519.square(u, y); Field25519.mult(v, u, D); Field25519.sub(u, u, z); // u = y^2 - 1 Field25519.sum(v, v, z); // v = dy^2 + 1

one would write:

Field25519Calculator calculator = new Field25519Calculator(); calculator.square(u, y); calculator.mult(v, u, D); calculator.sub(u, u, z); // u = y^2 - 1 calculator.sum(v, v, z); // v = dy^2 + 1

I think this would also be more intuitive.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/tink/pull/265?email_source=notifications&email_token=AGGH7XWUHEPPLZCOHBPGFOLQPASTFA5CNFSM4I5PZANKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBPLQBQ#issuecomment-543078406, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGGH7XUMT6THYU3SY2XCWZDQPASTFANCNFSM4I5PZANA .

Oct 17 '19 11:10 bleichen

How much memory are we saving with this PR?

In production this accounted for 42GiB of allocations every five minutes, which was about 35% of allocations in our very bloated monolith.

I also considered the executor object idea, I just got into the mode of hyper memory efficiency, but that's probably perfectly fine.

Oct 17 '19 20:10 eeSeeGee

Hi.

The
long[] t0 = new long[LIMB_CNT] style lines look they should be covered in Escape Analysis/Scalar Replacement.

https://stackoverflow.com/questions/43002528/when-can-hotspot-allocate-objects-on-the-stack https://stackoverflow.com/questions/9032519/eligibility-for-escape-analysis-stack-allocation-with-java-7

Make sure that the methods are actually getting inlined and hence eligible for EA. -XX:-PrintCompilation

Make sure that JFR is not actually disabling escape analysis. Enabling sampling profilers will disable it for sure. Not sure about JFR agent running in background.

https://bugs.openjdk.java.net/browse/JDK-8227745

For the

XYZ buf = new XYZ();
for (; i >= 0; i--) {	    for (; i >= 0; i--) {
  doubleXYZ(t, new XYZ(t));	      XYZ.fromPartialXYZT(buf, t);
  doubleXYZ(t, buf);

new XYZ() might be eligible for scalar replacement because the default constructor has a final static length so the arrays may be eligible and it looks like the XYZ doesn't escape the function. Did you try it with just that optimization?

XYZ() {
  this(new long[LIMB_CNT], new long[LIMB_CNT], new long[LIMB_CNT]);
}

Dec 03 '19 22:12 blipper

We ended up switching to Bouncy

Dec 05 '19 02:12 eeSeeGee

I'm closing this, as I don't think eeSeeGee is interested in doing the changes we wanted. Also, one should check whether escape analysis works, and keep the code if it does.

Apr 03 '20 15:04 tholenst

Reopening this, as I heard from Square that they're also concerned about high memory usage in Tink's Ed25519 implementation.

Thomas, how can we check whether escape analysis works?

May 21 '20 19:05 thaidn

Closing this again. It has not been looked at for 2.5 years.

Jan 26 '23 17:01 juergw

tink tink copied to clipboard

Re-use buffers in high-throughput execution paths.

tink
tink copied to clipboard