cudaErrorIllegalAddress error when using exp_pauli(...) on multiple GPUs
Required prerequisites
- [x] Consult the security policy. If reporting a security vulnerability, do not report the bug using this form. Use the process described in the policy to report the issue.
- [x] Make sure you've read the documentation. Your issue may be addressed there.
- [x] Search the issue tracker to verify that this hasn't already been reported. +1 or comment there if it has.
- [ ] If possible, make a PR with a failing test to give us a starting point to work on!
Describe the bug
When using exp_pauli(pauli string) in a kernel, cudaq.observe(...) and cudaq.sample(...) calls result in the error:
RuntimeError: cudaErrorIllegalAddress
RuntimeError: cudaErrorIllegalAddress
RuntimeError: cudaErrorIllegalAddress
terminate called after throwing an instance of 'ubackend::RuntimeError'
what(): cudaErrorIllegalAddress
if the Python script is started with multiple ranks/GPUs (mpirun -n 4 python ...). Single qubit gates and CX gates work without problems, even if the number of qubits exceeds the memory of one GPU and several GPUs are required. The error also does not occur with only one rank/GPU (mpirun -n 1 python ...).
Steps to reproduce the bug
Create exp_pauli.py script:
import cudaq
from cudaq import spin
from time import perf_counter_ns
cudaq.set_target("nvidia", option="mgpu")
rank = cudaq.mpi.rank()
@cudaq.kernel
def kernel(qubit_count: int):
qubits = cudaq.qvector(qubit_count)
exp_pauli(0.1, qubits, "IIIIYXYYIIIIIIIIIIIIIIIIIIIIII")
qubit_count=30
op = cudaq.spin.z(0)
t0 = perf_counter_ns()
result = cudaq.observe(kernel, op, qubit_count)
t = (perf_counter_ns() - t0) * 1e-9
if rank == 0:
print(f"time: {t:.3f} sec")
Start the script with several Ranks/GPUs:
mpirun -n 4 python exp_pauli.py
Expected behavior
Output of the time the cudaq.observe(...) call took
Is this a regression? If it is, put the last known working version (or commit) here.
Not a regression
Environment
- CUDA-Q version: 0.8 and 0.9 both tested
- Python version: 3.11
- Operating system: Linux
Suggestions
No response
Hi @FabianLangkabel,
This issue should have been fixed in the latest CUDA-Q release (0.10.0).
Running exp_pauli() gates with mgpu now works without problems with 0.10 and 0.11 but cudaq.observe() + mgpu now leads to the following error on for example 2 GPUs:
RuntimeError: cudaErrorMemoryAllocation
RuntimeError: cudaErrorMemoryAllocation
*** The MPI_Barrier() function was called after MPI_FINALIZE was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[gn24:4179415] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** The MPI_Barrier() function was called after MPI_FINALIZE was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[gn24:4179417] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** The MPI_Barrier() function was called after MPI_FINALIZE was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
cudaq.sample works fine in the same example script.
Hi Fabian, is this failure coming from your previous code?
import cudaq
from cudaq import spin
from time import perf_counter_ns
cudaq.set_target("nvidia", option="mgpu")
rank = cudaq.mpi.rank()
@cudaq.kernel
def kernel(qubit_count: int):
qubits = cudaq.qvector(qubit_count)
exp_pauli(0.1, qubits, "IIIIYXYYIIIIIIIIIIIIIIIIIIIIII")
qubit_count=30
op = cudaq.spin.z(0)
t0 = perf_counter_ns()
result = cudaq.observe(kernel, op, qubit_count)
t = (perf_counter_ns() - t0) * 1e-9
if rank == 0:
print(f"time: {t:.3f} sec")
I just tried this on a 4xGH200 system with the 0.11.0 container.
$ mpirun -n 4 python exp_pauli.py
time: 1.837 sec
I have just tested this again with a cudaq 0.11.0 singulratiy container and the error comes from the code shown, but only as soon as the required graphics memory exceeds that of a single GPU. In my case (4 A100 40GB) starting from 32 qubits. cudaq.sample works for up to 34 qubits on one node.
I appreciate the clarification Fabian.
Modifying the code to use 34 qubits on my 4xGH200 system I can also reproduce it.
$ nvq++ --version
nvq++ Version cu12-0.11.0 (https://github.com/NVIDIA/cuda-quantum 076fcaca322f2b2252c262005916e5f980b7b849)
# also tested on cu12-latest (https://github.com/NVIDIA/cuda-quantum c521d8d84c18a894117c2b1c2b1f73b7a03822ef)
$ cat exp_pauli.py
import cudaq
from cudaq import spin
from time import perf_counter_ns
cudaq.set_target("nvidia", option="mgpu")
rank = cudaq.mpi.rank()
@cudaq.kernel
def kernel(qubit_count: int):
qubits = cudaq.qvector(qubit_count)
exp_pauli(0.1, qubits, "IIIIYXYYIIIIIIIIIIIIIIIIIIIIIIIIII")
qubit_count=34
op = cudaq.spin.z(0)
t0 = perf_counter_ns()
result = cudaq.observe(kernel, op, qubit_count)
t = (perf_counter_ns() - t0) * 1e-9
if rank == 0:
print(f"time: {t:.3f} sec")
$ export CUDAQ_MGPU_P2P_DEVICE_BITS=2
$ mpirun -np 4 python exp_pauli.py
RuntimeError: cudaErrorMemoryAllocation
RuntimeError: cudaErrorMemoryAllocation
RuntimeError: cudaErrorMemoryAllocation
RuntimeError: cudaErrorMemoryAllocation
*** The MPI_Barrier() function was called after MPI_FINALIZE was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[utskinnyjoe-dvt-43:02065] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
...
Try setting the environment variable
export CUDAQ_MAX_GPU_MEMORY_GB=32
This is due to a memory allocation method in our software stack that tries to greedily allocate more memory than needed.
In your system setup (4x A100 40GB), the above setting could be export CUDAQ_MAX_GPU_MEMORY_GB=16 (20 should also be fine).
We require a large workspace to perform distributed expectation calculations via inner product method. Hence, the max qubit count is less than that of sampling workflow.
As @mitchdz mentioned, the root cause is that, in some cases, our software stack tried to allocate more memory than needed (e.g., to prevent future reallocations), causing OOM if an expectation value calculation (observe) is later requested. We'll look into fixing this memory forecast bug.
Thanks for the suggested solution. Everything seems to work for me now and the issue can be closed if you don't want to keep it open for the visibility of the memory forecast bug and the workaround.