cuda-quantum python kernel_builder qalloc call very slow for numpy arrays for 25+ qubits

Required prerequisites

[x] Consult the security policy. If reporting a security vulnerability, do not report the bug using this form. Use the process described in the policy to report the issue.
[x] Make sure you've read the documentation. Your issue may be addressed there.
[x] Search the issue tracker to verify that this hasn't already been reported. +1 or comment there if it has.
[x] If possible, make a PR with a failing test to give us a starting point to work on!

Describe the bug

I have been simulating a circuit with cudaq kernel that is allocated with an initial statevector and trying to debug a major performance decrease around 25 qubits.

I noticed that in the kernel.qalloc call there is a line for validating the norm which uses Python list comprehension. This becomes dramatically expensive around 25 qubits and when test locally uses about 30s for 26 qubits, taking up almost all the qalloc run time.

This is within an isinstance check that the input is a numpy ndarray, so it seems like a more efficient solution for this case is possible. For example, replacing with np.sum(np.conj(initializer) * initializer) reduces it to under 1s.

It is an issue as using numpy arrays has been the most efficient way I've been able to initiate large states when running cudaq with multiple GPUs.

Steps to reproduce the bug

import numpy as np
import time
for n in range(20,27):
    initializer = np.zeros(2 ** n, dtype=np.complex128)
    initializer[0] = 1.
    
    t0 = time.time()
    norm = sum([np.conj(a) * a for a in initializer])
    t1 = time.time()
    print(f"Time for {n} qubits: {t1 - t0:.2f}s")

Time for 20 qubits: 0.93s
Time for 21 qubits: 1.85s
Time for 22 qubits: 3.72s
Time for 23 qubits: 7.43s
Time for 24 qubits: 14.89s
Time for 25 qubits: 29.74s
Time for 26 qubits: 59.53s

And very similar times when calling the kernel.qalloc

import cudaq
import time
import numpy as np

for n in range(20,27):
    initializer = np.zeros(2 ** n, dtype=np.complex128)
    initializer[0] = 1.

    t0 = time.time()
    kernel = cudaq.make_kernel()
    qvec = kernel.qalloc(initializer)
    state = cudaq.get_state(kernel)
    t1 = time.time()
    print(f"Time for {n} qubits: {t1 - t0:.2f}s")

Time for 20 qubits: 0.50s
Time for 21 qubits: 0.93s
Time for 22 qubits: 1.83s
Time for 23 qubits: 3.68s
Time for 24 qubits: 7.42s
Time for 25 qubits: 14.54s
Time for 26 qubits: 28.98s
Time for 27 qubits: 58.90s

Expected behavior

It should be possible to efficiently call qalloc for a cudaq PyKernel with an existing numpy array with properly normalized state.

Is this a regression? If it is, put the last known working version (or commit) here.

Not a regression

Environment

I ran this example locally on a MacPro M4 as well as running on g4dn.12xlarge with the nvcr.io/nvidia/nightly/cuda-quantum:cu12-latest docker container and using the nvidia-mgpu target via mpiexec (though as it is a numpy call, this should all happen on the host CPU.

Docker image: nvcr.io/nvidia/nightly/cuda-quantum:cu12-latest
Image Tag: a4ce292b7aa2

In the docker container

CUDA-Q version: amd64-cu12-latest (https://github.com/NVIDIA/cuda-quantum 0e4e5fbc747ccb9100d7e16c5d8e91315084c908)
Python version: 3.12.3
C++ compiler: gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Operating system: Ubuntu 24.04.3 LTS

Suggestions

Replace the list comprehension with the numpy vectorized sum function np.sum(np.conj(initializer) * initializer)

Oct 23 '25 16:10 william-quanscient

Thanks @william-quanscient - these changes look good to me. I put them in PR https://github.com/NVIDIA/cuda-quantum/pull/3634 to run it through CI.

Nov 25 '25 03:11 mitchdz

An alternative to this would also to be using cudaq kernel mode for initializing the list. This would move the initialization and validation into compiled C++/MLIR runtime paths instead of the python list-comprehension we are seeing here.

Something like:

import cudaq
import numpy as np
import time

@cudaq.kernel
def zero_state_kernel(n: int):
    q = cudaq.qvector(n)

for n in range(20, 28):
    t0 = time.time()
    state = cudaq.get_state(zero_state_kernel, n)
    norm = np.vdot(state, state)
    t1 = time.time()
    print(f"Time for {n} qubits: {t1 - t0:.4f}s, ||ψ||² = {norm.real:.6f}")


Time for 20 qubits: 0.0640s, ||ψ||² = 1.000000
Time for 21 qubits: 0.0261s, ||ψ||² = 1.000000
Time for 22 qubits: 0.0568s, ||ψ||² = 1.000000
Time for 23 qubits: 0.1199s, ||ψ||² = 1.000000
Time for 24 qubits: 0.2043s, ||ψ||² = 1.000000
Time for 25 qubits: 0.3727s, ||ψ||² = 1.000000
Time for 26 qubits: 0.6856s, ||ψ||² = 1.000000
Time for 27 qubits: 1.2974s, ||ψ||² = 1.000000

Nov 26 '25 00:11 mitchdz

I was trying to recreate the issue via the devcontainer image to run the debug mode tests, but had some issue with the build locally.

In any case, I agree it makes sense to me to move this check into the compiled C++/MLIR paths. I am not aware if cudaq.qvector allows initialization with an arbitrary array. In a similar way, I was wondering if it makes sense to defer the check until after the data is moved onto the device, using either the data or init objects below.

                norm = sum([np.conj(a) * a for a in initializer])
                if np.abs(norm.imag) > 1e-4 or np.abs(1. - norm.real) > 1e-4:
                    raise RuntimeError(
                        "invalid input state for qalloc (not normalized)")

                veqTy = quake.VeqType.get(int(numQubits))
                qubits = quake.AllocaOp(veqTy).result
                data = self.capturedDataStorage.storeArray(initializer)
                init = quake.InitializeStateOp(qubits.type, qubits, data).result

I can look into it and come up with a proposal that runs normally (not debug mode) at least.

Nov 26 '25 08:11 william-quanscient