TenSEAL
TenSEAL copied to clipboard
Intel HEXL Support
Hi, thanks for the library. This is a neat project. It's the only python wrapper for SEAL I'm aware of that keeps up-to-date with the latest SEAL releases.
SEAL v3.6.3 adds support for Intel HEXL (https://github.com/intel/hexl), an AVX512 acceleration library. I'm wondering if you'd had a chance to try SEAL support for HEXL? See the Intel HE Toolkit whitepaper at (https://software.intel.com/content/www/us/en/develop/tools/homomorphic-encryption.html?wapkw=homomorphic%20encryption) for an idea of the performance improvement. I'm happy to take any feedback on HEXL as well (I'm one of the developers)
Hi @fboemer
Thank you for the wonderful contributions to Intel HEXL. I saw from SEAL's issue tracker that it adds an impressive speedup.
The latest TenSEAL release includes SEAL 3.6.3, as you mentioned, and we also activated the SEAL_USE_INTEL_HEXL flag in this release.
Unfortunately, we don't have hardware that supports AVX512 to measure the real improvement. We can see a speedup in the tests/benchmarks both locally and on GitHub runners, but we want to test it in an isolated container and on proper hardware, to be sure. We will do the benchmarks these days and upload them here.
Thank you!
Hello @fboemer
We did the benchmarks. Since the SEAL benchmarks are pretty clear, we mostly focused on ML operations.
We did the tests on a couple of AWS instances, the old version vs. the HEXL version and older hardware vs. hardware with AVX152 support.
TenSEAL 0.3.0 is the older version, with SEAL 3.6.2. TenSEAL 0.3.1 is the newer version, with SEAL 3.6.3 and HEXL on.
Compiler: Clang 10. OS: Ubuntu 20.04 Benchmarks were done with pytest-benchmark, with the median over 5 iterations reported here. The code for benchmarks is here.
Bottom line
The results are amazing for AVX512-compatible hardware. The MNIST full evaluation is faster, with 26%-34%, depending on the parallelism. There seems to be some impact on older CPUs. The MNIST full evaluation seems to be slower by 20% here.
If you have any feedback on the benchmarks, please let us know. One thing that I noticed and is not clear to me(sorry for the noob question): compiling the library on older hardware seems to affect the performance on newer hardware. I got the performance improvement only after I compiled the library on AVX512-compatible hardware. Is this expected?
Results
I will break the benchmarks per hardware.
c4.2xlarge
Specs: Intel Xeon E5-2666 v3, 8 CPUs, no AVX152 support
We notice that on unsupported hardware, there is some impact on performance.
Test case | TenSEAL 0.3.0 duration(ms) | TenSEAL 0.3.1 duration(ms) |
---|---|---|
CKKS convolution. Image shape 8x8 | 58.1 | 74.63 |
CKKS convolution. Image shape 16x16 | 58.15 | 74.55 |
CKKS convolution. Image shape 28x28 | 58.43 | 74.61 |
Generate keys | 949.97 | 847.33 |
mnist_prepare_input | 9.7 | 10.52 |
MNIST eval conv | 236.32 | 291.69 |
MNIST eval square1 | 8.24 | 10.63 |
MNIST eval fc1 | 1094.3 | 1324.82 |
MNIST eval square2 | 4.18 | 5.52 |
MNIST eval fc2 | 116.83 | 138.31 |
MNIST eval full | 1460.13 | 1770.5 |
Tensor length | Test case | TenSEAL 0.3.0 duration(ms) | TenSEAL 0.3.1 duration(ms) |
---|---|---|---|
8192 | CKKS add | 0.15 | 0.15 |
8192 | CKKS multiply | 8.63 | 11.34 |
8192 | CKKS negate | 0.13 | 0.14 |
8192 | CKKS square | 8.36 | 10.94 |
8192 | CKKS sub | 0.15 | 0.15 |
8192 | CKKS dot | 54.79 | 71.03 |
8192 | CKKS polyval | 20.47 | 24.23 |
16384 | CKKS add | 0.29 | 0.29 |
16384 | CKKS multiply | 17.24 | 22.69 |
16384 | CKKS negate | 0.25 | 0.28 |
16384 | CKKS square | 16.69 | 21.87 |
16384 | CKKS sub | 0.28 | 0.3 |
16384 | CKKS dot | 110.28 | 142.59 |
16384 | CKKS polyval | 41.04 | 48.87 |
c4.4xlarge
Specs: Intel Xeon E5-2666 v3, 16 CPUs, no AVX152 support.
We redid the test with more CPUs to confirm the impact.
Test case | TenSEAL 0.3.0 duration(ms) | TenSEAL 0.3.1 duration(ms) |
---|---|---|
Generate keys | 918.42 | 848.57 |
mnist_prepare_input | 9.72 | 10.56 |
MNIST eval conv | 234.87 | 292.02 |
MNIST eval square1 | 8.28 | 10.66 |
MNIST eval fc1 | 575.51 | 693.78 |
MNIST eval square2 | 4.17 | 5.54 |
MNIST eval fc2 | 68.58 | 80.9 |
MNIST eval full | 877.32 | 1060.43 |
CKKS convolution. Image shape 8x8 | 57.96 | 74.6 |
CKKS convolution. Image shape 16x16 | 58.07 | 74.58 |
CKKS convolution. Image shape 28x28 | 58.56 | 74.65 |
Tensor length | Test case | TenSEAL 0.3.0 duration(ms) | TenSEAL 0.3.1 duration(ms) |
---|---|---|---|
8192 | CKKS add | 0.16 | 0.15 |
8192 | CKKS multiply | 8.64 | 11.28 |
8192 | CKKS negate | 0.13 | 0.14 |
8192 | CKKS square | 8.36 | 10.91 |
8192 | CKKS sub | 0.15 | 0.15 |
8192 | CKKS dot | 54.93 | 70.45 |
8192 | CKKS polyval | 20.49 | 24.16 |
16384 | CKKS add | 0.31 | 0.28 |
16384 | CKKS multiply | 17.28 | 22.56 |
16384 | CKKS negate | 0.26 | 0.27 |
16384 | CKKS square | 16.72 | 21.8 |
16384 | CKKS sub | 0.29 | 0.29 |
16384 | CKKS dot | 110.11 | 141.34 |
16384 | CKKS polyval | 41.07 | 48.34 |
However, when we switch to hardware that supports AVX512, we can see a major improvement.
c5.2xlarge
Specs: Intel Xeon Platinum 8275CL, 8 CPUs, with AVX152 support
Test case | TenSEAL 0.3.0 duration(ms) | TenSEAL 0.3.1 duration(ms) |
---|---|---|
Generate keys | 819.0 | 633.85 |
mnist_prepare_input | 8.73 | 5.21 |
MNIST eval conv | 195.28 | 130.53 |
MNIST eval square1 | 6.86 | 4.38 |
MNIST eval fc1 | 923.04 | 587.95 |
MNIST eval square2 | 3.46 | 2.25 |
MNIST eval fc2 | 99.84 | 63.25 |
MNIST eval full | 1229.2 | 799.0 |
CKKS convolution. Image shape 8x8 | 48.77 | 34.04 |
CKKS convolution. Image shape 16x16 | 48.8 | 33.69 |
CKKS convolution. Image shape 28x28 | 49.2 | 33.21 |
Tensor length | Test case | TenSEAL 0.3.0 duration(ms) | TenSEAL 0.3.1 duration(ms) |
---|---|---|---|
8192 | CKKS add | 0.13 | 0.11 |
8192 | CKKS multiply | 7.2 | 4.6 |
8192 | CKKS negate | 0.1 | 0.11 |
8192 | CKKS square | 6.94 | 4.6 |
8192 | CKKS sub | 0.12 | 0.13 |
8192 | CKKS dot | 45.85 | 31.15 |
8192 | CKKS polyval | 17.8 | 11.53 |
16384 | CKKS add | 0.25 | 0.23 |
16384 | CKKS multiply | 14.5 | 9.26 |
16384 | CKKS negate | 0.2 | 0.24 |
16384 | CKKS square | 13.98 | 9.09 |
16384 | CKKS sub | 0.25 | 0.26 |
16384 | CKKS dot | 92.0 | 62.68 |
16384 | CKKS polyval | 36.0 | 23.49 |
c5.4xlarge
Specs: Intel Xeon Platinum 8275CL, 16 CPUs, with AVX152 support
Test case | TenSEAL 0.3.0 duration(ms) | TenSEAL 0.3.1 duration(ms) |
---|---|---|
Generate keys | 781.07 | 625.9 |
mnist_prepare_input | 8.44 | 5.77 |
MNIST eval conv | 186.44 | 143.01 |
MNIST eval square1 | 6.47 | 4.79 |
MNIST eval fc1 | 451.27 | 337.13 |
MNIST eval square2 | 3.29 | 2.47 |
MNIST eval fc2 | 55.79 | 40.46 |
MNIST eval full | 712.69 | 526.74 |
CKKS convolution. Image shape 8x8 | 46.24 | 36.43 |
CKKS convolution. Image shape 16x16 | 46.25 | 37.11 |
CKKS convolution. Image shape 28x28 | 46.91 | 37.05 |
Tensor length | Test case | TenSEAL 0.3.0 duration(ms) | TenSEAL 0.3.1 duration(ms) |
---|---|---|---|
8192 | CKKS add | 0.12 | 0.11 |
8192 | CKKS multiply | 6.84 | 5.18 |
8192 | CKKS negate | 0.1 | 0.11 |
8192 | CKKS square | 6.62 | 5.04 |
8192 | CKKS sub | 0.12 | 0.13 |
8192 | CKKS dot | 43.8 | 34.91 |
8192 | CKKS polyval | 16.81 | 12.8 |
16384 | CKKS add | 0.26 | 0.24 |
16384 | CKKS multiply | 13.78 | 10.41 |
16384 | CKKS negate | 0.2 | 0.23 |
16384 | CKKS square | 13.4 | 10.18 |
16384 | CKKS sub | 0.25 | 0.26 |
16384 | CKKS dot | 87.5 | 69.48 |
16384 | CKKS polyval | 33.95 | 25.74 |
@bcebere , thanks for the detailed report! Our current HEXL implementation/integration has focused on improving performance on AVX512-enabled machines. In particular, the recent Intel processors with the AVX512-IFMA52 instruction set (IceLake server, IceLake client) should yield up to additional ~2x speedup (see performance numbers in Tables 1-4 of https://arxiv.org/pdf/2103.16400.pdf) than the CascadeLake servers you tried. We'll investigate the performance regression on non-AVX512 processors, thanks for pointing this out.
Regarding the library compilation: we currently compile for AVX512 only on machines supporting the AVX512 instruction set. We'd be happy to investigate enabling AVX512 compilation for non-AVX512 machines, if that would be helpful. I imagine this would help enable AVX512-enabled tenseal
package distribution?
Thank you so much for the explanations!
Regarding the compilation: we build, package and deploy the library to PyPI using Github runners, and we don't have much control over the hardware we're using. Furthermore, we cannot distinguish at pip install
between supported architectures. Having a single binary for both scenarios(AVX152 and non-AVX152) compiled on any hardware would be fantastic!