OpenFHE op timing seems off by a factor of 10
In https://github.com/google/heir/pull/2423 I added a per-op timing table for OpenFHE ops.
bazel test -c opt --test_output=all --//:openfhe_enable_timing=1 //tests/Examples/openfhe/ckks/halevi_shoup_matvec/halevi_shoup_matvec_interpreter_test
Which logs
--- Timing Results ---
Operation Total Time (s) Total Time (%) Count Average Latency (s)
MulPlain 0.045352 25.4033 16 0.0028345
GenRotKey 0.0372787 20.8812 1 0.0372787
MakeCKKSPackedPlaintext 0.0264277 14.8032 18 0.00146821
Encrypt 0.0189263 10.6013 1 0.0189263
Rot 0.0144668 8.1034 3 0.00482227
FastRotation 0.0141603 7.93169 3 0.00472009
GenContext 0.0138492 7.75746 1 0.0138492
Add 0.00462038 2.58804 15 0.000308025
FastRotationPrecompute 0.00230724 1.29237 1 0.00230724
AddPlain 0.00113905 0.638023 1 0.00113905
The Add operation reports an average latency of 0.000308025 seconds = 308 microseconds. In https://openfhe.discourse.group/t/single-threading-performs-faster-than-multi-threading/907 the authors of OpenFHE suggest 31.5 us is a more typical latency for Add. I think the other ops have similarly suspicious runtimes (e.g., mul_plain is 2.8ms, add_plain is 1.1ms).
I suspect there is a build misconfiguration in our bazel setup, since we haven't ever looked hard at the "right" way to configure OpenFHE (see also https://github.com/google/heir/issues/1741).
Experimenting with the OpenFHE benchmark suite (outside of HEIR)
I can get similar timing numbers with
cmake -DCMAKE_BUILD_TYPE=Release -DWITH_NTL=ON -DWITH_TCM=OFF -DMATHBACKEND=6 -DWITH_NATIVEOPT=ON -DNATIVE_SIZE=64 -DBUILD_BENCHMARKS=ON -DBUILD_UNITTESTS=OFF -DBUILD_EXAMPLES-OFF ..
./bin/benchmarks/lib-benchmark
...
CKKSrns_Add 23.1 us 23.1 us 31595
And then with config trying to match our bazel build:
cmake -DCMAKE_BUILD_TYPE=Release -DWITH_NTL=OFF -DWITH_TCM=OFF -DMATHBACKEND=4 -DWITH_NATIVEOPT=OFF -DNATIVE_SIZE=64 -DBUILD_BENCHMARKS=ON -DBUILD_EXAMPLES=OFF -DBUILD_UNITTESTS=OFF ..
CKKSrns_Add 28.4 us 28.4 us 25275
I don't see a huge difference here.
In https://github.com/google/heir/pull/2425 I ported the OpenFHE benchmark script to compare it with the bazel build, and it... surprisingly has good runtime
bazel run -c opt //benchmark:openfhe_benchmark
...
CKKSrns_Add 28.0 us 28.0 us 20403
So it's NOT the bazel build (phew!)
The crypto config seems to be part of the problem here.
The benchmark script uses only 8 slots via the SetBatchSize config (not sure if this implies there is replication happening during MakeCKKSPackedPlaintext...), and sets different values than the defaults for scaling mod size and scaling technique.
static CryptoContext<DCRTPoly> GenerateCKKSContext(uint32_t mdepth = 1) {
CCParams<CryptoContextCKKSRNS> parameters;
parameters.SetScalingModSize(48); // Default is 50
parameters.SetBatchSize(8); // slot count??
parameters.SetScalingTechnique(FIXEDMANUAL); // default is FLEXIBLEAUTOEXT
parameters.SetMultiplicativeDepth(mdepth);
auto cc = GenCryptoContext(parameters);
cc->Enable(PKE);
cc->Enable(KEYSWITCH);
cc->Enable(LEVELEDSHE);
return cc;
}
Meanwhile, the HEIR test in question sets only the mul depth to 1.
func.func @matvec__generate_crypto_context() -> !cc {
%params = openfhe.gen_params {mulDepth = 1 : i64, plainMod = 0 : i64} : () -> !params
%cc = openfhe.gen_context %params {supportFHE = false} : (!params) -> !cc
return %cc : !cc
}
func.func @matvec__configure_crypto_context(%cc: !cc, %sk: !sk) -> !cc {
openfhe.gen_rotkey %cc, %sk {indices = array<i64: 1, 2, 3, 4, 8, 12>} : (!cc, !sk) -> ()
return %cc : !cc
}
https://github.com/openfheorg/openfhe-development/blob/main/src/pke/include/scheme/gen-cryptocontext-params-defaults.h#L62
With the benchmark options, add takes 30-40us. Without those options set (not setting batch-size at all), I get 109us, which is still not 300us like in HEIR, but closer to the core of the problem here...
CC @ZenithalHourlyRate as I think you may have some insights here, at least as to the correct values to set for openfhe performance. I think my next step is to dig into how the params are being set for this test.
So I think I have figured out the core of the issue here:
By default OpenFHE uses FLEXIBLEAUTOEXT scaling, which in turn requires the minimal ring dimension to be 16384, whereas these benchmarks use FIXEDMANUAL scaling, which silently makes the underlying ring dimension 8192.
For ring dim = 16384, FIXEDMANUAL gives ~60-70us per add, while FLEXIBLEAUTOEXT gives 110-120 us.
For ring dim = 8192 (have to disable security standard in OpenFHE to make this work), FIXEDMANUAL gives ~30-40us per add, while FLEXIBLEAUTOEXT gives 50-60 us.
So the doubly large ring dimension doubles the runtime (as expected) and the scaling method seems to also roughly double the runtime. This is the part that doesn't make sense to me (why would the scaling method affect addition performance so drastically?) but I did find this that suggests there is something happening here with regards to a "larger" scaling factor: https://github.com/openfheorg/openfhe-development/blob/aa391988d354d4360f390f223a90e0d1b98839d7/src/pke/lib/scheme/ckksrns/ckksrns-leveledshe.cpp#L273
Worth pointing out a stray TODO that was not logged as part of #1145 https://github.com/google/heir/blob/f7c321d866dc9a6909b673b6da8d77decb373e89/lib/Pipelines/ArithmeticPipelineRegistration.cpp#L449
But it basically seems that HEIR has no current method to pick the scaling technique from the IR, and leaves it as the default.
In FLEXIBLEAUTOEXT there is one more modulus than FIXEDMANUAL so the ring dim indeed might be larger. https://eprint.iacr.org/2022/915 has some discussion but not in detail.
The case of mulDepth = 1 is so fragile. Maybe you should choose a benchmark with larger mulDepth so ringDim is fixed to N=2^16, which is also the usual value paper will benchmark against.
It is weird, going back to the interpreter test that started this, with all else fixed (and 4096 slots), the performance of mul_plain between FLEXIBLEAUTOEXT and FIXEDMANUAL differs by 20x! (2.5ms vs 124us, respectively)
I will try higher mul depth, but it just seems so strange.
in case of mulplain, there is a procedure called AdjustCiphertextForAdd/Mul that takes some extra time.
发件人: Jeremy Kun @.> 发送时间: Thursday, November 20, 2025 7:39:56 AM 收件人: google/heir @.> 抄送: Hongren Zheng @.>; Mention @.> 主题: Re: [google/heir] OpenFHE op timing seems off by a factor of 10 (Issue #2424)
[https://avatars.githubusercontent.com/u/2467754?s=20&v=4]j2kun left a comment (google/heir#2424)https://github.com/google/heir/issues/2424#issuecomment-3555088369
It is weird, going back to the interpreter test that started this, with all else fixed (and 4096 slots), the performance of mul_plain between FLEXIBLEAUTOEXT and FIXEDMANUAL differs by 20x! (2.5ms vs 124us, respectively)
I will try higher mul depth, but it just seems so strange.
― Reply to this email directly, view it on GitHubhttps://github.com/google/heir/issues/2424#issuecomment-3555088369, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEU32YRUU42VQ553KAEA57T35T5UZAVCNFSM6AAAAACMTNTBCWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKNJVGA4DQMZWHE. You are receiving this because you were mentioned.Message ID: @.***>
in FLEXIBLEAUTOEXT if it is the first mul plain, there is a rescaling right before the mul and might also be recorded in the benchmark timing.
发件人: Hongren Zheng @.> 发送时间: Thursday, November 20, 2025 7:42:36 AM 收件人: google/heir @.> 抄送: Hongren Zheng @.>; Your activity @.> 主题: Re: [google/heir] OpenFHE op timing seems off by a factor of 10 (Issue #2424)
[https://avatars.githubusercontent.com/u/19512674?s=20&v=4]ZenithalHourlyRate left a comment (google/heir#2424)https://github.com/google/heir/issues/2424#issuecomment-3555093722 in case of mulplain, there is a procedure called AdjustCiphertextForAdd/Mul that takes some extra time.
发件人: Jeremy Kun @.> 发送时间: Thursday, November 20, 2025 7:39:56 AM 收件人: google/heir @.> 抄送: Hongren Zheng @.>; Mention @.> 主题: Re: [google/heir] OpenFHE op timing seems off by a factor of 10 (Issue #2424)
[https://avatars.githubusercontent.com/u/2467754?s=20&v=4]j2kun left a comment (google/heir#2424)https://github.com/google/heir/issues/2424#issuecomment-3555088369
It is weird, going back to the interpreter test that started this, with all else fixed (and 4096 slots), the performance of mul_plain between FLEXIBLEAUTOEXT and FIXEDMANUAL differs by 20x! (2.5ms vs 124us, respectively)
I will try higher mul depth, but it just seems so strange.
D Reply to this email directly, view it on GitHubhttps://github.com/google/heir/issues/2424#issuecomment-3555088369, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEU32YRUU42VQ553KAEA57T35T5UZAVCNFSM6AAAAACMTNTBCWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKNJVGA4DQMZWHE. You are receiving this because you were mentioned.Message ID: @.***>
― Reply to this email directly, view it on GitHubhttps://github.com/google/heir/issues/2424#issuecomment-3555093722, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEU32YVOL3AUKF74EFFTONT35T56ZAVCNFSM6AAAAACMTNTBCWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKNJVGA4TGNZSGI. You are receiving this because you are subscribed to this thread.Message ID: @.***>