heir OpenFHE op timing seems off by a factor of 10

In https://github.com/google/heir/pull/2423 I added a per-op timing table for OpenFHE ops.

bazel test -c opt --test_output=all --//:openfhe_enable_timing=1 //tests/Examples/openfhe/ckks/halevi_shoup_matvec/halevi_shoup_matvec_interpreter_test

Which logs

--- Timing Results ---
Operation                     Total Time (s)      Total Time (%)      Count               Average Latency (s)
MulPlain                      0.045352            25.4033             16                  0.0028345
GenRotKey                     0.0372787           20.8812             1                   0.0372787
MakeCKKSPackedPlaintext       0.0264277           14.8032             18                  0.00146821
Encrypt                       0.0189263           10.6013             1                   0.0189263
Rot                           0.0144668           8.1034              3                   0.00482227
FastRotation                  0.0141603           7.93169             3                   0.00472009
GenContext                    0.0138492           7.75746             1                   0.0138492
Add                           0.00462038          2.58804             15                  0.000308025
FastRotationPrecompute        0.00230724          1.29237             1                   0.00230724
AddPlain                      0.00113905          0.638023            1                   0.00113905

The Add operation reports an average latency of 0.000308025 seconds = 308 microseconds. In https://openfhe.discourse.group/t/single-threading-performs-faster-than-multi-threading/907 the authors of OpenFHE suggest 31.5 us is a more typical latency for Add. I think the other ops have similarly suspicious runtimes (e.g., mul_plain is 2.8ms, add_plain is 1.1ms).

I suspect there is a build misconfiguration in our bazel setup, since we haven't ever looked hard at the "right" way to configure OpenFHE (see also https://github.com/google/heir/issues/1741).

Nov 19 '25 18:11 j2kun

Experimenting with the OpenFHE benchmark suite (outside of HEIR)

I can get similar timing numbers with

cmake -DCMAKE_BUILD_TYPE=Release -DWITH_NTL=ON -DWITH_TCM=OFF -DMATHBACKEND=6 -DWITH_NATIVEOPT=ON -DNATIVE_SIZE=64 -DBUILD_BENCHMARKS=ON -DBUILD_UNITTESTS=OFF -DBUILD_EXAMPLES-OFF ..
./bin/benchmarks/lib-benchmark

...


CKKSrns_Add                         23.1 us         23.1 us        31595

And then with config trying to match our bazel build:

cmake -DCMAKE_BUILD_TYPE=Release -DWITH_NTL=OFF -DWITH_TCM=OFF -DMATHBACKEND=4 -DWITH_NATIVEOPT=OFF -DNATIVE_SIZE=64 -DBUILD_BENCHMARKS=ON -DBUILD_EXAMPLES=OFF -DBUILD_UNITTESTS=OFF ..

CKKSrns_Add                         28.4 us         28.4 us        25275

I don't see a huge difference here.

Nov 19 '25 19:11 j2kun

In https://github.com/google/heir/pull/2425 I ported the OpenFHE benchmark script to compare it with the bazel build, and it... surprisingly has good runtime

bazel run -c opt //benchmark:openfhe_benchmark

...

CKKSrns_Add                         28.0 us         28.0 us        20403

So it's NOT the bazel build (phew!)

Nov 19 '25 19:11 j2kun

The crypto config seems to be part of the problem here.

The benchmark script uses only 8 slots via the SetBatchSize config (not sure if this implies there is replication happening during MakeCKKSPackedPlaintext...), and sets different values than the defaults for scaling mod size and scaling technique.

static CryptoContext<DCRTPoly> GenerateCKKSContext(uint32_t mdepth = 1) {
    CCParams<CryptoContextCKKSRNS> parameters;
    parameters.SetScalingModSize(48);                 // Default is 50
    parameters.SetBatchSize(8);                       // slot count??
    parameters.SetScalingTechnique(FIXEDMANUAL);      // default is FLEXIBLEAUTOEXT
    parameters.SetMultiplicativeDepth(mdepth);
    auto cc = GenCryptoContext(parameters);
    cc->Enable(PKE);
    cc->Enable(KEYSWITCH);
    cc->Enable(LEVELEDSHE);
    return cc;
}

Meanwhile, the HEIR test in question sets only the mul depth to 1.

func.func @matvec__generate_crypto_context() -> !cc {
  %params = openfhe.gen_params  {mulDepth = 1 : i64, plainMod = 0 : i64} : () -> !params
  %cc = openfhe.gen_context %params {supportFHE = false} : (!params) -> !cc
  return %cc : !cc
}
func.func @matvec__configure_crypto_context(%cc: !cc, %sk: !sk) -> !cc {
  openfhe.gen_rotkey %cc, %sk {indices = array<i64: 1, 2, 3, 4, 8, 12>} : (!cc, !sk) -> ()
  return %cc : !cc
}

https://github.com/openfheorg/openfhe-development/blob/main/src/pke/include/scheme/gen-cryptocontext-params-defaults.h#L62

With the benchmark options, add takes 30-40us. Without those options set (not setting batch-size at all), I get 109us, which is still not 300us like in HEIR, but closer to the core of the problem here...

Nov 19 '25 20:11 j2kun

CC @ZenithalHourlyRate as I think you may have some insights here, at least as to the correct values to set for openfhe performance. I think my next step is to dig into how the params are being set for this test.

Nov 19 '25 20:11 j2kun

So I think I have figured out the core of the issue here:

By default OpenFHE uses FLEXIBLEAUTOEXT scaling, which in turn requires the minimal ring dimension to be 16384, whereas these benchmarks use FIXEDMANUAL scaling, which silently makes the underlying ring dimension 8192.

For ring dim = 16384, FIXEDMANUAL gives ~60-70us per add, while FLEXIBLEAUTOEXT gives 110-120 us.

For ring dim = 8192 (have to disable security standard in OpenFHE to make this work), FIXEDMANUAL gives ~30-40us per add, while FLEXIBLEAUTOEXT gives 50-60 us.

So the doubly large ring dimension doubles the runtime (as expected) and the scaling method seems to also roughly double the runtime. This is the part that doesn't make sense to me (why would the scaling method affect addition performance so drastically?) but I did find this that suggests there is something happening here with regards to a "larger" scaling factor: https://github.com/openfheorg/openfhe-development/blob/aa391988d354d4360f390f223a90e0d1b98839d7/src/pke/lib/scheme/ckksrns/ckksrns-leveledshe.cpp#L273

Nov 19 '25 21:11 j2kun

Worth pointing out a stray TODO that was not logged as part of #1145 https://github.com/google/heir/blob/f7c321d866dc9a6909b673b6da8d77decb373e89/lib/Pipelines/ArithmeticPipelineRegistration.cpp#L449

But it basically seems that HEIR has no current method to pick the scaling technique from the IR, and leaves it as the default.

Nov 19 '25 22:11 j2kun

In FLEXIBLEAUTOEXT there is one more modulus than FIXEDMANUAL so the ring dim indeed might be larger. https://eprint.iacr.org/2022/915 has some discussion but not in detail.

The case of mulDepth = 1 is so fragile. Maybe you should choose a benchmark with larger mulDepth so ringDim is fixed to N=2^16, which is also the usual value paper will benchmark against.

Nov 19 '25 23:11 ZenithalHourlyRate

It is weird, going back to the interpreter test that started this, with all else fixed (and 4096 slots), the performance of mul_plain between FLEXIBLEAUTOEXT and FIXEDMANUAL differs by 20x! (2.5ms vs 124us, respectively)

I will try higher mul depth, but it just seems so strange.

Nov 19 '25 23:11 j2kun

in case of mulplain, there is a procedure called AdjustCiphertextForAdd/Mul that takes some extra time.

发件人: Jeremy Kun @.> 发送时间: Thursday, November 20, 2025 7:39:56 AM 收件人: google/heir @.> 抄送: Hongren Zheng @.>; Mention @.> 主题: Re: [google/heir] OpenFHE op timing seems off by a factor of 10 (Issue #2424)

[https://avatars.githubusercontent.com/u/2467754?s=20&v=4]j2kun left a comment (google/heir#2424)https://github.com/google/heir/issues/2424#issuecomment-3555088369

It is weird, going back to the interpreter test that started this, with all else fixed (and 4096 slots), the performance of mul_plain between FLEXIBLEAUTOEXT and FIXEDMANUAL differs by 20x! (2.5ms vs 124us, respectively)

I will try higher mul depth, but it just seems so strange.

― Reply to this email directly, view it on GitHubhttps://github.com/google/heir/issues/2424#issuecomment-3555088369, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEU32YRUU42VQ553KAEA57T35T5UZAVCNFSM6AAAAACMTNTBCWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKNJVGA4DQMZWHE. You are receiving this because you were mentioned.Message ID: @.***>

Nov 19 '25 23:11 ZenithalHourlyRate

in FLEXIBLEAUTOEXT if it is the first mul plain, there is a rescaling right before the mul and might also be recorded in the benchmark timing.

发件人: Hongren Zheng @.> 发送时间: Thursday, November 20, 2025 7:42:36 AM 收件人: google/heir @.> 抄送: Hongren Zheng @.>; Your activity @.> 主题: Re: [google/heir] OpenFHE op timing seems off by a factor of 10 (Issue #2424)

[https://avatars.githubusercontent.com/u/19512674?s=20&v=4]ZenithalHourlyRate left a comment (google/heir#2424)https://github.com/google/heir/issues/2424#issuecomment-3555093722 in case of mulplain, there is a procedure called AdjustCiphertextForAdd/Mul that takes some extra time.

发件人: Jeremy Kun @.> 发送时间: Thursday, November 20, 2025 7:39:56 AM 收件人: google/heir @.> 抄送: Hongren Zheng @.>; Mention @.> 主题: Re: [google/heir] OpenFHE op timing seems off by a factor of 10 (Issue #2424)

[https://avatars.githubusercontent.com/u/2467754?s=20&v=4]j2kun left a comment (google/heir#2424)https://github.com/google/heir/issues/2424#issuecomment-3555088369

It is weird, going back to the interpreter test that started this, with all else fixed (and 4096 slots), the performance of mul_plain between FLEXIBLEAUTOEXT and FIXEDMANUAL differs by 20x! (2.5ms vs 124us, respectively)

I will try higher mul depth, but it just seems so strange.

D Reply to this email directly, view it on GitHubhttps://github.com/google/heir/issues/2424#issuecomment-3555088369, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEU32YRUU42VQ553KAEA57T35T5UZAVCNFSM6AAAAACMTNTBCWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKNJVGA4DQMZWHE. You are receiving this because you were mentioned.Message ID: @.***>

― Reply to this email directly, view it on GitHubhttps://github.com/google/heir/issues/2424#issuecomment-3555093722, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEU32YVOL3AUKF74EFFTONT35T56ZAVCNFSM6AAAAACMTNTBCWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKNJVGA4TGNZSGI. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Nov 19 '25 23:11 ZenithalHourlyRate