Update speed tests to measure GPU performance for cuPQC code
Discussed in https://github.com/orgs/open-quantum-safe/discussions/2076
Originally posted by lakshya-chopra February 11, 2025
In the current version of libOQS, running the speed_kem.c tests for ML-KEM is using CPU cycles as a benchmark for GPU based cuPQC (on platforms with GPU & where OQS_USE_CUPQC=ON). To verify this, I added debug statements in the following file to check which function gets called. To my surprise, running the speed test always invoked cuPQC's function, yet the reported benchmark results were still based on CPU cycle counts.
Build CMD:
cmake -DBUILD_SHARED_LIBS=ON -DOQS_USE_OPENSSL=OFF -DCMAKE_BUILD_TYPE=Release -DOQS_DIST_BUILD=ON \
-DOQS_USE_CUPQC=ON -DCMAKE_PREFIX_PATH=/home/master/cupqc/cupqc-pkg-0.2.0/cmake \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.6/bin/nvcc -DCMAKE_CUDA_ARCHITECTURES=86 \
-DOQS_ENABLE_KEM_ml_kem_768_cuda=ON ..
Speed comparisons
To further confirm this, I compared the speed results of Kyber768 & ML-KEM-768 (which should be similar) and got these results:
$ ./speed_kem Kyber768
Configuration info
==================
Target platform: x86_64-Linux-5.15.0-131-generic
Compiler: gcc (11.4.0)
Compile options: [-Wa,--noexecstack;-O3;-fomit-frame-pointer;-fdata-sections;-ffunction-sections;-Wl,--gc-sections;-Wbad-function-cast]
OQS version: 0.12.1-dev (major: 0, minor: 12, patch: 1, pre-release: -dev)
Git commit: 5afca642057faa54878cf6937b46fe6f00b45646
OpenSSL enabled: No
AES: NI
SHA-2: C
SHA-3: C
OQS build flags: BUILD_SHARED_LIBS OQS_DIST_BUILD OQS_LIBJADE_BUILD OQS_OPT_TARGET=generic CMAKE_BUILD_TYPE=Release
CPU exts active: ADX AES AVX AVX2 BMI1 BMI2 PCLMULQDQ POPCNT SSE SSE2 SSE3
Speed test
==========
Started at 2025-02-12 18:37:02
Operation | Iterations | Total time (s) | Time (us): mean | pop. stdev | CPU cycles: mean | pop. stdev
------------------------------------ | ----------:| --------------:| ---------------:| ----------:| -------------------------:| ----------:
Kyber768 | | | | | |
keygen | 376913 | 3.000 | 7.959 | 0.736 | 19219 | 1532
encaps | 295155 | 3.000 | 10.164 | 0.486 | 24552 | 923
decaps | 377094 | 3.000 | 7.956 | 0.527 | 19211 | 891
For ML-KEM-768:
OQS build flags: BUILD_SHARED_LIBS OQS_DIST_BUILD OQS_LIBJADE_BUILD OQS_OPT_TARGET=generic CMAKE_BUILD_TYPE=Release
CPU exts active: ADX AES AVX AVX2 BMI1 BMI2 PCLMULQDQ POPCNT SSE SSE2 SSE3
Speed test
==========
Started at 2025-02-12 18:36:45
Operation | Iterations | Total time (s) | Time (us): mean | pop. stdev | CPU cycles: mean | pop. stdev
------------------------------------ | ----------:| --------------:| ---------------:| ----------:| -------------------------:| ----------:
ML-KEM-768 | | | | | |
keygen | 18847 | 3.000 | 159.178 | 539.811 | 385029 | 1305897
encaps | 19025 | 3.000 | 157.695 | 5.361 | 381451 | 12921
decaps | 18271 | 3.000 | 164.196 | 5.137 | 397182 | 12384
Clearly, these results are far off & do not represent an accurate picture.
Feature Request
It would be beneficial if the speed test could accurately measure GPU performance when cuPQC is used.
As an example,
If this is an actual issue, I’d be happy to help :)
@lakshya-chopra this is now (literally, in GitHub terms) an actual issue—please feel free to help if your offer still stands.
So I tried to give this a go on uWaterloo's eceubuntu, but ended up finding that cuda 12.9 and whatever compiler was installed on it do not cooperate. (The issue was essentially the same as this: https://discuss.pytorch.org/t/pytorch-build-error/217957)
I'll probably rent a GPU server at some point and see if I can get cuPQC working.
I'll probably rent a GPU server at some point and see if I can get cuPQC working.
We should be able to get resources from PQCA for this if needed.
I'll probably rent a GPU server at some point and see if I can get cuPQC working.
We should be able to get resources from PQCA for this if needed.
Oh that would be awesome. I shouldn't need anything for too long - essentially just enough to get a working installation of nvcc that is compatible with the version of gcc used to build the code.
@ryjones Is there a possibility of getting access to a (small) GPU server to do some testing of the cuPQC integration?
@dstebila if it is available from EC2, yes. @mkannwischer has the most experience, and I can work with him to set it up. @SWilson4 also has access to make this happen
@aidenfoxivey what type of EC2 instance would you need?
I think anything like a G4? Nothing super fancy - just a GPU of some kind and the AMI to have a working Cuda installation.
Just the cheapest NVIDIA GPU enabled instance will be perfect!
I think you can build on https://github.com/open-quantum-safe/liboqs/blob/refs/heads/sw-benchmarks/.github/workflows/ec2_reusable.yml but @SWilson4 is the SME
I think you can build on https://github.com/open-quantum-safe/liboqs/blob/refs/heads/sw-benchmarks/.github/workflows/ec2_reusable.yml but @SWilson4 is the SME
FYI Spencer has moved on to a new position, so we'll need a new SME on that.
oh, dang.
That workflow there is for a CI container though, right? As in it can't run persistently?
ah OK. Please shoot me an email - [email protected] - with your public SSH key
Sent!
Sent!
OK, your key is there. let me know when you're done with it
~Perfect - just started working on this issue.~
Seems the VM is down.
No luck today with getting the oqs VM to work. I’m assuming it crashed past some point of utilization? I was unable to get the compilation to get past ~10%.
I then tried to use gcc-12 and nvcc-12.9 on ecetesla1.uwaterloo.ca (a 3070Ti machine from the ECE department). With the following setup, I unfortunately had no luck:
export CXX=/usr/bin/g++-12 export CC=/usr/bin/gcc-12 export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} cmake -DBUILD_SHARED_LIBS=ON \ -DOQS_USE_OPENSSL=OFF \ -DCMAKE_BUILD_TYPE=Release \ -DOQS_DIST_BUILD=ON \ -DOQS_USE_CUPQC=ON \ -DCMAKE_PREFIX_PATH=/home/abfoxive/cupqc-sdk-0.3.0-x86_64/cmake \ -DCMAKE_CUDA_COMPILER=/usr/bin/nvcc \ -DCMAKE_CUDA_ARCHITECTURES=86 \ -DOQS_ENABLE_KEM_ml_kem_768_cuda=ON \ -DCUDA_USE_STATIC_CUDA_RUNTIME=OFF ..
The error message was persistently:
nvlink warning : Skipping incompatible '/lib/x86_64-linux-gnu/librt.a' when searching for -lrt nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/librt.a' when searching for -lrt nvlink warning : Skipping incompatible '/lib/x86_64-linux-gnu/libpthread.a' when searching for -lpthread nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/libpthread.a' when searching for -lpthread nvlink warning : Skipping incompatible '/lib/x86_64-linux-gnu/libdl.a' when searching for -ldl nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/libdl.a' when searching for -ldl nvlink fatal : elfLink linker library load error make[2]: *** [src/CMakeFiles/oqs.dir/build.make:4213: src/CMakeFiles/oqs.dir/cmake_device_link.o] Error 1 make[1]: *** [CMakeFiles/Makefile2:1039: src/CMakeFiles/oqs.dir/all] Error 2
I’m guessing that this is a versioning mismatch? I’ll try again tomorrow.
OK, I started a new VM that is much larger and the build completed
$ ./speed_kem Kyber768
Configuration info
==================
Target platform: x86_64-Linux-6.1.141-165.249.amzn2023.x86_64
Compiler: gcc (11.5.0)
Compile options: [-Wa,--noexecstack;-O3;-fomit-frame-pointer;-fdata-sections;-ffunction-sections;-Wl,--gc-sections;-Wbad-function-cast]
OQS version: 0.14.1-dev (major: 0, minor: 14, patch: 1, pre-release: -dev)
Git commit: 78e23891802a8bc058ad435491f1b5aefcef092a
OpenSSL enabled: No
AES: NI
SHA-2: C
SHA-3: AVX512VL
OQS build flags: BUILD_SHARED_LIBS OQS_DIST_BUILD OQS_LIBJADE_BUILD OQS_OPT_TARGET=generic CMAKE_BUILD_TYPE=Release
CPU exts active: ADX AES AVX AVX2 AVX512 BMI1 BMI2 PCLMULQDQ POPCNT SSE SSE2 SSE3
Speed test
==========
Started at 2025-07-17 14:09:08
Operation | Iterations | Total time (s) | Time (us): mean | pop. stdev | CPU cycles: mean | pop. stdev
------------------------------------ | ----------:| --------------:| ---------------:| ----------:| -------------------------:| ----------:
Kyber768 | | | | | |
keygen | 218770 | 3.000 | 13.713 | 9.844 | 34167 | 24576
encaps | 173160 | 3.000 | 17.325 | 7.767 | 43174 | 19396
decaps | 228020 | 3.000 | 13.157 | 5.358 | 32809 | 13372
Ended at 2025-07-17 14:09:17
perfect! it's building properly for me too
Just so everyone is aware, using a t2.small instance was the problem. I switched to t3.2xlarge and it worked.
I think we'll have a few issues with that. I don't believe the T2 series of AWS VMs has an attached GPU.
I'm also not 100% sure that Amazon Linux (6.1.141-165.249.amzn2023.x86_64) is compatible with Cuda. Attempting to compile with nvcc seems to yield that the installed gcc is too old.
Deep Learning Base OSS Nvidia Driver GPU AMI (Amazon Linux 2023) 20250715 is the AMI I used
The issue being that nvidia-smi shows there's no attached GPU. There might be software support for NVCC, but there's no GPU to attach to.
[ec2-user@ip-172-31-17-233 ~]$ /bin/gcc14-cpp --version
gcc14-cpp (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7)
Copyright (C) 2024 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
[ec2-user@ip-172-31-17-233 ~]$ gcc --version
gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5)
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
[ec2-user@ip-172-31-17-233 ~]$ which gcc
/usr/bin/gcc
Well the AMI is fine yes, but the actual instance is a T2 instance, which does not have a GPU on it.
These are the GPU instances: https://docs.aws.amazon.com/dlami/latest/devguide/gpu.html
cool so this is verified - I'm just sorting out the dynamic linking detection bit