liboqs icon indicating copy to clipboard operation
liboqs copied to clipboard

Update speed tests to measure GPU performance for cuPQC code

Open dstebila opened this issue 7 months ago • 34 comments

Discussed in https://github.com/orgs/open-quantum-safe/discussions/2076

Originally posted by lakshya-chopra February 11, 2025 In the current version of libOQS, running the speed_kem.c tests for ML-KEM is using CPU cycles as a benchmark for GPU based cuPQC (on platforms with GPU & where OQS_USE_CUPQC=ON). To verify this, I added debug statements in the following file to check which function gets called. To my surprise, running the speed test always invoked cuPQC's function, yet the reported benchmark results were still based on CPU cycle counts.

image

Build CMD:

cmake -DBUILD_SHARED_LIBS=ON  -DOQS_USE_OPENSSL=OFF  -DCMAKE_BUILD_TYPE=Release -DOQS_DIST_BUILD=ON  \
-DOQS_USE_CUPQC=ON  -DCMAKE_PREFIX_PATH=/home/master/cupqc/cupqc-pkg-0.2.0/cmake   \    
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.6/bin/nvcc  -DCMAKE_CUDA_ARCHITECTURES=86    \    
-DOQS_ENABLE_KEM_ml_kem_768_cuda=ON ..

Speed comparisons

To further confirm this, I compared the speed results of Kyber768 & ML-KEM-768 (which should be similar) and got these results:


$ ./speed_kem Kyber768
Configuration info
==================
Target platform:  x86_64-Linux-5.15.0-131-generic
Compiler:         gcc (11.4.0)
Compile options:  [-Wa,--noexecstack;-O3;-fomit-frame-pointer;-fdata-sections;-ffunction-sections;-Wl,--gc-sections;-Wbad-function-cast]
OQS version:      0.12.1-dev (major: 0, minor: 12, patch: 1, pre-release: -dev)
Git commit:       5afca642057faa54878cf6937b46fe6f00b45646
OpenSSL enabled:  No
AES:              NI
SHA-2:            C
SHA-3:            C
OQS build flags:  BUILD_SHARED_LIBS OQS_DIST_BUILD OQS_LIBJADE_BUILD OQS_OPT_TARGET=generic CMAKE_BUILD_TYPE=Release
CPU exts active:  ADX AES AVX AVX2 BMI1 BMI2 PCLMULQDQ POPCNT SSE SSE2 SSE3
Speed test
==========
Started at 2025-02-12 18:37:02
Operation                            | Iterations | Total time (s) | Time (us): mean | pop. stdev | CPU cycles: mean          | pop. stdev
------------------------------------ | ----------:| --------------:| ---------------:| ----------:| -------------------------:| ----------:
Kyber768                             |            |                |                 |            |                           |
keygen                               |     376913 |          3.000 |           7.959 |      0.736 |                     19219 |       1532
encaps                               |     295155 |          3.000 |          10.164 |      0.486 |                     24552 |        923
decaps                               |     377094 |          3.000 |           7.956 |      0.527 |                     19211 |        891

For ML-KEM-768:

OQS build flags:  BUILD_SHARED_LIBS OQS_DIST_BUILD OQS_LIBJADE_BUILD OQS_OPT_TARGET=generic CMAKE_BUILD_TYPE=Release
CPU exts active:  ADX AES AVX AVX2 BMI1 BMI2 PCLMULQDQ POPCNT SSE SSE2 SSE3
Speed test
==========
Started at 2025-02-12 18:36:45
Operation                            | Iterations | Total time (s) | Time (us): mean | pop. stdev | CPU cycles: mean          | pop. stdev
------------------------------------ | ----------:| --------------:| ---------------:| ----------:| -------------------------:| ----------:
ML-KEM-768                           |            |                |                 |            |                           |
keygen                               |      18847 |          3.000 |         159.178 |    539.811 |                    385029 |    1305897
encaps                               |      19025 |          3.000 |         157.695 |      5.361 |                    381451 |      12921
decaps                               |      18271 |          3.000 |         164.196 |      5.137 |                    397182 |      12384

Clearly, these results are far off & do not represent an accurate picture.

Feature Request

It would be beneficial if the speed test could accurately measure GPU performance when cuPQC is used. As an example, image

If this is an actual issue, I’d be happy to help :)

dstebila avatar Jun 05 '25 14:06 dstebila

@lakshya-chopra this is now (literally, in GitHub terms) an actual issue—please feel free to help if your offer still stands.

SWilson4 avatar Jun 18 '25 18:06 SWilson4

So I tried to give this a go on uWaterloo's eceubuntu, but ended up finding that cuda 12.9 and whatever compiler was installed on it do not cooperate. (The issue was essentially the same as this: https://discuss.pytorch.org/t/pytorch-build-error/217957)

aidenfoxivey avatar Jul 14 '25 22:07 aidenfoxivey

I'll probably rent a GPU server at some point and see if I can get cuPQC working.

aidenfoxivey avatar Jul 15 '25 00:07 aidenfoxivey

I'll probably rent a GPU server at some point and see if I can get cuPQC working.

We should be able to get resources from PQCA for this if needed.

dstebila avatar Jul 15 '25 00:07 dstebila

I'll probably rent a GPU server at some point and see if I can get cuPQC working.

We should be able to get resources from PQCA for this if needed.

Oh that would be awesome. I shouldn't need anything for too long - essentially just enough to get a working installation of nvcc that is compatible with the version of gcc used to build the code.

aidenfoxivey avatar Jul 15 '25 01:07 aidenfoxivey

@ryjones Is there a possibility of getting access to a (small) GPU server to do some testing of the cuPQC integration?

dstebila avatar Jul 15 '25 15:07 dstebila

@dstebila if it is available from EC2, yes. @mkannwischer has the most experience, and I can work with him to set it up. @SWilson4 also has access to make this happen

ryjones avatar Jul 15 '25 17:07 ryjones

@aidenfoxivey what type of EC2 instance would you need?

ryjones avatar Jul 16 '25 16:07 ryjones

I think anything like a G4? Nothing super fancy - just a GPU of some kind and the AMI to have a working Cuda installation.

aidenfoxivey avatar Jul 16 '25 17:07 aidenfoxivey

Just the cheapest NVIDIA GPU enabled instance will be perfect!

aidenfoxivey avatar Jul 16 '25 17:07 aidenfoxivey

I think you can build on https://github.com/open-quantum-safe/liboqs/blob/refs/heads/sw-benchmarks/.github/workflows/ec2_reusable.yml but @SWilson4 is the SME

ryjones avatar Jul 16 '25 17:07 ryjones

I think you can build on https://github.com/open-quantum-safe/liboqs/blob/refs/heads/sw-benchmarks/.github/workflows/ec2_reusable.yml but @SWilson4 is the SME

FYI Spencer has moved on to a new position, so we'll need a new SME on that.

dstebila avatar Jul 16 '25 17:07 dstebila

oh, dang.

ryjones avatar Jul 16 '25 18:07 ryjones

That workflow there is for a CI container though, right? As in it can't run persistently?

aidenfoxivey avatar Jul 16 '25 18:07 aidenfoxivey

ah OK. Please shoot me an email - [email protected] - with your public SSH key

ryjones avatar Jul 16 '25 18:07 ryjones

Sent!

aidenfoxivey avatar Jul 16 '25 18:07 aidenfoxivey

Sent!

OK, your key is there. let me know when you're done with it

ryjones avatar Jul 16 '25 18:07 ryjones

~Perfect - just started working on this issue.~

Seems the VM is down.

aidenfoxivey avatar Jul 16 '25 18:07 aidenfoxivey

No luck today with getting the oqs VM to work. I’m assuming it crashed past some point of utilization? I was unable to get the compilation to get past ~10%.

I then tried to use gcc-12 and nvcc-12.9 on ecetesla1.uwaterloo.ca (a 3070Ti machine from the ECE department). With the following setup, I unfortunately had no luck:

export CXX=/usr/bin/g++-12 export CC=/usr/bin/gcc-12 export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} cmake -DBUILD_SHARED_LIBS=ON \ -DOQS_USE_OPENSSL=OFF \ -DCMAKE_BUILD_TYPE=Release \ -DOQS_DIST_BUILD=ON \ -DOQS_USE_CUPQC=ON \ -DCMAKE_PREFIX_PATH=/home/abfoxive/cupqc-sdk-0.3.0-x86_64/cmake \ -DCMAKE_CUDA_COMPILER=/usr/bin/nvcc \ -DCMAKE_CUDA_ARCHITECTURES=86 \ -DOQS_ENABLE_KEM_ml_kem_768_cuda=ON \ -DCUDA_USE_STATIC_CUDA_RUNTIME=OFF .. The error message was persistently:

nvlink warning : Skipping incompatible '/lib/x86_64-linux-gnu/librt.a' when searching for -lrt nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/librt.a' when searching for -lrt nvlink warning : Skipping incompatible '/lib/x86_64-linux-gnu/libpthread.a' when searching for -lpthread nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/libpthread.a' when searching for -lpthread nvlink warning : Skipping incompatible '/lib/x86_64-linux-gnu/libdl.a' when searching for -ldl nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/libdl.a' when searching for -ldl nvlink fatal : elfLink linker library load error make[2]: *** [src/CMakeFiles/oqs.dir/build.make:4213: src/CMakeFiles/oqs.dir/cmake_device_link.o] Error 1 make[1]: *** [CMakeFiles/Makefile2:1039: src/CMakeFiles/oqs.dir/all] Error 2 I’m guessing that this is a versioning mismatch? I’ll try again tomorrow.

aidenfoxivey avatar Jul 16 '25 22:07 aidenfoxivey

OK, I started a new VM that is much larger and the build completed

ryjones avatar Jul 17 '25 14:07 ryjones


$ ./speed_kem Kyber768

Configuration info
==================
Target platform:  x86_64-Linux-6.1.141-165.249.amzn2023.x86_64
Compiler:         gcc (11.5.0)
Compile options:  [-Wa,--noexecstack;-O3;-fomit-frame-pointer;-fdata-sections;-ffunction-sections;-Wl,--gc-sections;-Wbad-function-cast]
OQS version:      0.14.1-dev (major: 0, minor: 14, patch: 1, pre-release: -dev)
Git commit:       78e23891802a8bc058ad435491f1b5aefcef092a
OpenSSL enabled:  No
AES:              NI
SHA-2:            C
SHA-3:            AVX512VL
OQS build flags:  BUILD_SHARED_LIBS OQS_DIST_BUILD OQS_LIBJADE_BUILD OQS_OPT_TARGET=generic CMAKE_BUILD_TYPE=Release 
CPU exts active:  ADX AES AVX AVX2 AVX512 BMI1 BMI2 PCLMULQDQ POPCNT SSE SSE2 SSE3
Speed test
==========
Started at 2025-07-17 14:09:08
Operation                            | Iterations | Total time (s) | Time (us): mean | pop. stdev | CPU cycles: mean          | pop. stdev
------------------------------------ | ----------:| --------------:| ---------------:| ----------:| -------------------------:| ----------:
Kyber768                             |            |                |                 |            |                           |           
keygen                               |     218770 |          3.000 |          13.713 |      9.844 |                     34167 |      24576
encaps                               |     173160 |          3.000 |          17.325 |      7.767 |                     43174 |      19396
decaps                               |     228020 |          3.000 |          13.157 |      5.358 |                     32809 |      13372
Ended at 2025-07-17 14:09:17

ryjones avatar Jul 17 '25 14:07 ryjones

perfect! it's building properly for me too

aidenfoxivey avatar Jul 17 '25 15:07 aidenfoxivey

Just so everyone is aware, using a t2.small instance was the problem. I switched to t3.2xlarge and it worked.

ryjones avatar Jul 17 '25 15:07 ryjones

I think we'll have a few issues with that. I don't believe the T2 series of AWS VMs has an attached GPU.

I'm also not 100% sure that Amazon Linux (6.1.141-165.249.amzn2023.x86_64) is compatible with Cuda. Attempting to compile with nvcc seems to yield that the installed gcc is too old.

aidenfoxivey avatar Jul 17 '25 16:07 aidenfoxivey

Deep Learning Base OSS Nvidia Driver GPU AMI (Amazon Linux 2023) 20250715 is the AMI I used

ryjones avatar Jul 17 '25 16:07 ryjones

The issue being that nvidia-smi shows there's no attached GPU. There might be software support for NVCC, but there's no GPU to attach to.

aidenfoxivey avatar Jul 17 '25 16:07 aidenfoxivey

[ec2-user@ip-172-31-17-233 ~]$ /bin/gcc14-cpp --version
gcc14-cpp (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7)
Copyright (C) 2024 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[ec2-user@ip-172-31-17-233 ~]$ gcc --version
gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5)
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[ec2-user@ip-172-31-17-233 ~]$ which gcc
/usr/bin/gcc

ryjones avatar Jul 17 '25 16:07 ryjones

I'm looking here

ryjones avatar Jul 17 '25 16:07 ryjones

Well the AMI is fine yes, but the actual instance is a T2 instance, which does not have a GPU on it.

These are the GPU instances: https://docs.aws.amazon.com/dlami/latest/devguide/gpu.html

aidenfoxivey avatar Jul 17 '25 17:07 aidenfoxivey

Image

cool so this is verified - I'm just sorting out the dynamic linking detection bit

aidenfoxivey avatar Jul 17 '25 19:07 aidenfoxivey