cuda-quantum Please help me to simulate on mluti node

Hi I want to simulate the statevector on multi nodes. I checked this and successfully implemented on multi gpu but not in multi nodes.

i am not sure how i set up env in each machines and how to run the application?

would you like to let me know the most similar examples or what i should refer to, please? Thanks in advance.

Aug 13 '25 23:08 CoolMLAI

manually started server on each machine. But encountered the following error.

terminate called after throwing an instance of 'std::runtime_error' what(): Failed to launch kernel. Error: Failed to execute the kernel on the remote server: "Failed to process incoming request" Error message: "Failed to open simulator backend library: /root/miniconda3/envs/cudaq-env/lib/python3.11/site-packages/bin/../lib/libnvqir-nvidia-mqpu.so: cannot open shared object file: No such file or directory."

where can i get this so file? would you like to tell me how to solve this out? Thanks

Aug 14 '25 08:08 CoolMLAI

Error message: "Failed to open simulator backend library: /root/miniconda3/envs/cudaq-env/lib/python3.11/site-packages/bin/../lib/libnvqir-nvidia-mqpu.so: cannot open shared object file: No such file or directory."

It might just be a typo. It should be nvidia-mgpu if we want to use the distributed state vector simulator backend as the remote virtual QPU.

Aug 14 '25 08:08 1tnguyen

Regarding the question about multi-node simulation:

I want to simulate the statevector on multi nodes. I checked this https://nvidia.github.io/cuda-quantum/latest/using/backends/sims/svsims.html#multi-gpu-multi-node and successfully implemented on multi gpu but not in multi nodes.

i am not sure how i set up env in each machines and how to run the application?

It depends on the way we configure the cluster. Usually, the multi-node multi-GPU capability of the nvidia target would just work across the whole cluster or a set of nodes allocated by job schedulers, e.g., SLURM.

Aug 14 '25 08:08 1tnguyen

hi @1tnguyen well, when i used multi nodes for statevector simulation, should i manually start the server?

this is right documentation for it?

cudaq_location=`python3 -m pip show cudaq | grep -e 'Location: .*$'`
qpud_py="${cudaq_location#Location: }/bin/cudaq-qpud.py"
CUDA_VISIBLE_DEVICES=0,1 mpiexec -np 2 python3 "$qpud_py" --port <QPU 1 TCP/IP port number>

will you plz let me know the example of SLURM usage?

Aug 14 '25 13:08 CoolMLAI

Hi @CoolMLAI

When using the remote-mqpu platform, each virtual QPU (on each machine with 2 GPUs) acts as an independent simulator. This is designed for a scale-out workload, e.g., multiple quantum kernels to be executed in parallel. It would not help to increase the number of qubits, as if we combined all the 4 GPUs across the two machines to perform a single distributed state vector simulation. This might explain the "requested size is too big" error message.

If you want to combine multiple workstations into an HPC cluster so that we can run MPI across them, this tutorial may be helpful. We might also want to confirm that CUDA-aware MPI is working across all the GPUs in the cluster, e.g., with a simple test program, after the configuration.

Aug 15 '25 02:08 1tnguyen

Hi @1tnguyen Thanks for your kind reply. i really appreciate your help.

I see the cudaq docs and it suggests openmpi instead of MPICH2. I wonder cudaq is compatible with MPICH2 either. For example, when i use mpich2 instead of openmpi, i encountered segmenation fault error.

for openmpi, the docs suggest env variable config but for mpich2, it doesn't.

can you help me out to fix this error? ☺

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 1829338 RUNNING AT 115cf9e2f1c2
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions``

Aug 16 '25 11:08 CoolMLAI

I see the cudaq docs and it suggests openmpi instead of MPICH2. I wonder cudaq is compatible with MPICH2 either.

We've tested CUDA-Q's nvidia-mgpu simulator with MPICH, in particular, Cray MPICH. I don't have any experience with compiling and installing MPICH. The Cray MPICH that I used was installed on the cluster by the system administrator.

Regarding your errors, a few questions/suggestions that I can think of

How did you install MPICH2 (e.g., from a package manager or installing from source)? Have you tried running some simple CUDA-aware test programs to validate the installation?
Did the 'Segmentation fault' error above occur only for multi-node runs or in all cases (e.g., within a single node)? Node-node communication may use a different communication stack.

Aug 18 '25 00:08 1tnguyen

Thanks for your reply. i installed mpich with pip instead of openmpi. i have the code that works well with mgpu in one node but if i replace openmpi with mpich2, then the segementation fault occured. so the error has occured in one node so i didn't even try the multi nodes.

so the solution is when i create the clusters, i should choose template that consists of Cray MPICH, right? Could you plz tell me more about the way you tested so i can reproduce it, 🙂 ?

Aug 18 '25 02:08 CoolMLAI

The mpich package from pip wouldn't have CUDA support. In our docs, we recommended using openmpi from conda-forge because it has CUDA support. Unfortunately, the mpich package on conda-forge is not CUDA-aware (see here)

If you want to use MPICH, one option is to build MPICH, following the "GPU support" instructions in its documentation.

Aug 18 '25 04:08 1tnguyen

Thanks very much, @1tnguyen 🙇 I really appreaciate your help.

actually, i implemented the multi gpu option in one node for getting statevector but i found the werid thing. i set target-option as mgpu, fp32 but the statevector's peaked bitstring is a little different from the correct one for 37 qubits peaked circuit. for small qubits(less than 37), it seeks correctly. really strange. i wonder it is due to the data type of index of cupy or round error, but not sure.

i am really curious what causes these error. Could you plz tell me your opinion regarding this? 🙂

No problem, @CoolMLAI.

For the issue above, is there a reproducer test case that you can share? It's easier to debug it with the code.

Aug 18 '25 23:08 1tnguyen