liboqs Update speed tests to measure GPU performance for cuPQC code

Discussed in https://github.com/orgs/open-quantum-safe/discussions/2076

^{Originally posted by lakshya-chopra February 11, 2025} In the current version of libOQS, running the speed_kem.c tests for ML-KEM is using CPU cycles as a benchmark for GPU based cuPQC (on platforms with GPU & where OQS_USE_CUPQC=ON). To verify this, I added debug statements in the following file to check which function gets called. To my surprise, running the speed test always invoked cuPQC's function, yet the reported benchmark results were still based on CPU cycle counts.

Build CMD:

cmake -DBUILD_SHARED_LIBS=ON  -DOQS_USE_OPENSSL=OFF  -DCMAKE_BUILD_TYPE=Release -DOQS_DIST_BUILD=ON  \
-DOQS_USE_CUPQC=ON  -DCMAKE_PREFIX_PATH=/home/master/cupqc/cupqc-pkg-0.2.0/cmake   \    
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.6/bin/nvcc  -DCMAKE_CUDA_ARCHITECTURES=86    \    
-DOQS_ENABLE_KEM_ml_kem_768_cuda=ON ..

Speed comparisons

To further confirm this, I compared the speed results of Kyber768 & ML-KEM-768 (which should be similar) and got these results:


$ ./speed_kem Kyber768
Configuration info
==================
Target platform:  x86_64-Linux-5.15.0-131-generic
Compiler:         gcc (11.4.0)
Compile options:  [-Wa,--noexecstack;-O3;-fomit-frame-pointer;-fdata-sections;-ffunction-sections;-Wl,--gc-sections;-Wbad-function-cast]
OQS version:      0.12.1-dev (major: 0, minor: 12, patch: 1, pre-release: -dev)
Git commit:       5afca642057faa54878cf6937b46fe6f00b45646
OpenSSL enabled:  No
AES:              NI
SHA-2:            C
SHA-3:            C
OQS build flags:  BUILD_SHARED_LIBS OQS_DIST_BUILD OQS_LIBJADE_BUILD OQS_OPT_TARGET=generic CMAKE_BUILD_TYPE=Release
CPU exts active:  ADX AES AVX AVX2 BMI1 BMI2 PCLMULQDQ POPCNT SSE SSE2 SSE3
Speed test
==========
Started at 2025-02-12 18:37:02
Operation                            | Iterations | Total time (s) | Time (us): mean | pop. stdev | CPU cycles: mean          | pop. stdev
------------------------------------ | ----------:| --------------:| ---------------:| ----------:| -------------------------:| ----------:
Kyber768                             |            |                |                 |            |                           |
keygen                               |     376913 |          3.000 |           7.959 |      0.736 |                     19219 |       1532
encaps                               |     295155 |          3.000 |          10.164 |      0.486 |                     24552 |        923
decaps                               |     377094 |          3.000 |           7.956 |      0.527 |                     19211 |        891

For ML-KEM-768:

OQS build flags:  BUILD_SHARED_LIBS OQS_DIST_BUILD OQS_LIBJADE_BUILD OQS_OPT_TARGET=generic CMAKE_BUILD_TYPE=Release
CPU exts active:  ADX AES AVX AVX2 BMI1 BMI2 PCLMULQDQ POPCNT SSE SSE2 SSE3
Speed test
==========
Started at 2025-02-12 18:36:45
Operation                            | Iterations | Total time (s) | Time (us): mean | pop. stdev | CPU cycles: mean          | pop. stdev
------------------------------------ | ----------:| --------------:| ---------------:| ----------:| -------------------------:| ----------:
ML-KEM-768                           |            |                |                 |            |                           |
keygen                               |      18847 |          3.000 |         159.178 |    539.811 |                    385029 |    1305897
encaps                               |      19025 |          3.000 |         157.695 |      5.361 |                    381451 |      12921
decaps                               |      18271 |          3.000 |         164.196 |      5.137 |                    397182 |      12384

Clearly, these results are far off & do not represent an accurate picture.

Feature Request

It would be beneficial if the speed test could accurately measure GPU performance when cuPQC is used. As an example,

If this is an actual issue, I’d be happy to help :)

Jun 05 '25 14:06 dstebila

@lakshya-chopra this is now (literally, in GitHub terms) an actual issue—please feel free to help if your offer still stands.

Jun 18 '25 18:06 SWilson4

So I tried to give this a go on uWaterloo's eceubuntu, but ended up finding that cuda 12.9 and whatever compiler was installed on it do not cooperate. (The issue was essentially the same as this: https://discuss.pytorch.org/t/pytorch-build-error/217957)

Jul 14 '25 22:07 aidenfoxivey

I'll probably rent a GPU server at some point and see if I can get cuPQC working.

Jul 15 '25 00:07 aidenfoxivey

I'll probably rent a GPU server at some point and see if I can get cuPQC working.

We should be able to get resources from PQCA for this if needed.

Jul 15 '25 00:07 dstebila

I'll probably rent a GPU server at some point and see if I can get cuPQC working.

We should be able to get resources from PQCA for this if needed.

Oh that would be awesome. I shouldn't need anything for too long - essentially just enough to get a working installation of nvcc that is compatible with the version of gcc used to build the code.

Jul 15 '25 01:07 aidenfoxivey

@ryjones Is there a possibility of getting access to a (small) GPU server to do some testing of the cuPQC integration?

Jul 15 '25 15:07 dstebila

@dstebila if it is available from EC2, yes. @mkannwischer has the most experience, and I can work with him to set it up. @SWilson4 also has access to make this happen

Jul 15 '25 17:07 ryjones

@aidenfoxivey what type of EC2 instance would you need?

Jul 16 '25 16:07 ryjones

I think anything like a G4? Nothing super fancy - just a GPU of some kind and the AMI to have a working Cuda installation.

Jul 16 '25 17:07 aidenfoxivey

Just the cheapest NVIDIA GPU enabled instance will be perfect!

Jul 16 '25 17:07 aidenfoxivey

I think you can build on https://github.com/open-quantum-safe/liboqs/blob/refs/heads/sw-benchmarks/.github/workflows/ec2_reusable.yml but @SWilson4 is the SME

Jul 16 '25 17:07 ryjones

I think you can build on https://github.com/open-quantum-safe/liboqs/blob/refs/heads/sw-benchmarks/.github/workflows/ec2_reusable.yml but @SWilson4 is the SME

FYI Spencer has moved on to a new position, so we'll need a new SME on that.

Jul 16 '25 17:07 dstebila

oh, dang.

Jul 16 '25 18:07 ryjones

That workflow there is for a CI container though, right? As in it can't run persistently?

Jul 16 '25 18:07 aidenfoxivey

ah OK. Please shoot me an email - [email protected] - with your public SSH key

Jul 16 '25 18:07 ryjones

Sent!

Jul 16 '25 18:07 aidenfoxivey

Sent!

OK, your key is there. let me know when you're done with it

Jul 16 '25 18:07 ryjones

~Perfect - just started working on this issue.~

Seems the VM is down.

Jul 16 '25 18:07 aidenfoxivey

No luck today with getting the oqs VM to work. I’m assuming it crashed past some point of utilization? I was unable to get the compilation to get past ~10%.

I then tried to use gcc-12 and nvcc-12.9 on ecetesla1.uwaterloo.ca (a 3070Ti machine from the ECE department). With the following setup, I unfortunately had no luck:

export CXX=/usr/bin/g++-12 export CC=/usr/bin/gcc-12 export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} cmake -DBUILD_SHARED_LIBS=ON \ -DOQS_USE_OPENSSL=OFF \ -DCMAKE_BUILD_TYPE=Release \ -DOQS_DIST_BUILD=ON \ -DOQS_USE_CUPQC=ON \ -DCMAKE_PREFIX_PATH=/home/abfoxive/cupqc-sdk-0.3.0-x86_64/cmake \ -DCMAKE_CUDA_COMPILER=/usr/bin/nvcc \ -DCMAKE_CUDA_ARCHITECTURES=86 \ -DOQS_ENABLE_KEM_ml_kem_768_cuda=ON \ -DCUDA_USE_STATIC_CUDA_RUNTIME=OFF .. The error message was persistently:

nvlink warning : Skipping incompatible '/lib/x86_64-linux-gnu/librt.a' when searching for -lrt nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/librt.a' when searching for -lrt nvlink warning : Skipping incompatible '/lib/x86_64-linux-gnu/libpthread.a' when searching for -lpthread nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/libpthread.a' when searching for -lpthread nvlink warning : Skipping incompatible '/lib/x86_64-linux-gnu/libdl.a' when searching for -ldl nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/libdl.a' when searching for -ldl nvlink fatal : elfLink linker library load error make[2]: *** [src/CMakeFiles/oqs.dir/build.make:4213: src/CMakeFiles/oqs.dir/cmake_device_link.o] Error 1 make[1]: *** [CMakeFiles/Makefile2:1039: src/CMakeFiles/oqs.dir/all] Error 2 I’m guessing that this is a versioning mismatch? I’ll try again tomorrow.

Jul 16 '25 22:07 aidenfoxivey

OK, I started a new VM that is much larger and the build completed

Jul 17 '25 14:07 ryjones


$ ./speed_kem Kyber768

Configuration info
==================
Target platform:  x86_64-Linux-6.1.141-165.249.amzn2023.x86_64
Compiler:         gcc (11.5.0)
Compile options:  [-Wa,--noexecstack;-O3;-fomit-frame-pointer;-fdata-sections;-ffunction-sections;-Wl,--gc-sections;-Wbad-function-cast]
OQS version:      0.14.1-dev (major: 0, minor: 14, patch: 1, pre-release: -dev)
Git commit:       78e23891802a8bc058ad435491f1b5aefcef092a
OpenSSL enabled:  No
AES:              NI
SHA-2:            C
SHA-3:            AVX512VL
OQS build flags:  BUILD_SHARED_LIBS OQS_DIST_BUILD OQS_LIBJADE_BUILD OQS_OPT_TARGET=generic CMAKE_BUILD_TYPE=Release 
CPU exts active:  ADX AES AVX AVX2 AVX512 BMI1 BMI2 PCLMULQDQ POPCNT SSE SSE2 SSE3
Speed test
==========
Started at 2025-07-17 14:09:08
Operation                            | Iterations | Total time (s) | Time (us): mean | pop. stdev | CPU cycles: mean          | pop. stdev
------------------------------------ | ----------:| --------------:| ---------------:| ----------:| -------------------------:| ----------:
Kyber768                             |            |                |                 |            |                           |           
keygen                               |     218770 |          3.000 |          13.713 |      9.844 |                     34167 |      24576
encaps                               |     173160 |          3.000 |          17.325 |      7.767 |                     43174 |      19396
decaps                               |     228020 |          3.000 |          13.157 |      5.358 |                     32809 |      13372
Ended at 2025-07-17 14:09:17

Jul 17 '25 14:07 ryjones

perfect! it's building properly for me too

Jul 17 '25 15:07 aidenfoxivey

Just so everyone is aware, using a t2.small instance was the problem. I switched to t3.2xlarge and it worked.

Jul 17 '25 15:07 ryjones

I think we'll have a few issues with that. I don't believe the T2 series of AWS VMs has an attached GPU.

I'm also not 100% sure that Amazon Linux (6.1.141-165.249.amzn2023.x86_64) is compatible with Cuda. Attempting to compile with nvcc seems to yield that the installed gcc is too old.

Jul 17 '25 16:07 aidenfoxivey

Deep Learning Base OSS Nvidia Driver GPU AMI (Amazon Linux 2023) 20250715 is the AMI I used

Jul 17 '25 16:07 ryjones

The issue being that nvidia-smi shows there's no attached GPU. There might be software support for NVCC, but there's no GPU to attach to.

Jul 17 '25 16:07 aidenfoxivey

[ec2-user@ip-172-31-17-233 ~]$ /bin/gcc14-cpp --version
gcc14-cpp (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7)
Copyright (C) 2024 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[ec2-user@ip-172-31-17-233 ~]$ gcc --version
gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5)
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[ec2-user@ip-172-31-17-233 ~]$ which gcc
/usr/bin/gcc

Jul 17 '25 16:07 ryjones

I'm looking here

Jul 17 '25 16:07 ryjones

Well the AMI is fine yes, but the actual instance is a T2 instance, which does not have a GPU on it.

These are the GPU instances: https://docs.aws.amazon.com/dlami/latest/devguide/gpu.html

Jul 17 '25 17:07 aidenfoxivey

cool so this is verified - I'm just sorting out the dynamic linking detection bit

Jul 17 '25 19:07 aidenfoxivey