cuda-quantum icon indicating copy to clipboard operation
cuda-quantum copied to clipboard

MPI error when scaling out beyond 1 server

Open pioch-02 opened this issue 7 months ago • 12 comments

Required prerequisites

  • [x] Consult the security policy. If reporting a security vulnerability, do not report the bug using this form. Use the process described in the policy to report the issue.
  • [x] Make sure you've read the documentation. Your issue may be addressed there.
  • [x] Search the issue tracker to verify that this hasn't already been reported. +1 or comment there if it has.
  • [x] If possible, make a PR with a failing test to give us a starting point to work on!

Describe the bug

Workloads crash with MPI errors when using more than 1 node. Platform background: DGX H100 with a Slurm and Pyxis, container nvcr.io/nvidia/quantum/cuda-quantum:cu12-0.10.0 Job started with srun --mpi=pmix --container-image=$HOME/nvidia+quantum+cuda-quantum+cu12-0.10.0.sqsh bash -c "(export CUDAQ_MPI_COMM_LIB=${HOME}/libcudaq_distributed_interface_mpi.so; python ${HOME}/mgpu.py)"

When running with containers the error message looks like:

*** An error occurred in MPI_Allgatherv
*** reported by process [351672370,508]
*** on communicator MPI_COMM_WORLD
*** MPI_ERR_TRUNCATE: message truncated
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)

We have also compiled it as an environment module, the crash when using those have more information

==== backtrace (tid:3833225) ====
 0 0x0000000000003703 uct_cma_ep_tx_error()  /tmp/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/uct/sm/scopy/cma/cma_ep.c:81
 1 0x0000000000003a38 uct_cma_ep_tx()  /tmp/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/uct/sm/scopy/cma/cma_ep.c:114
 2 0x000000000001dad5 uct_scopy_ep_progress_tx()  /tmp/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/uct/sm/scopy/base/scopy_ep.c:151
 3 0x0000000000021181 ucs_arbiter_dispatch_nonempty()  /tmp/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/ucs/datastruct/arbiter.c:321
 4 0x000000000001d5f9 ucs_arbiter_dispatch()  /tmp/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/ucs/datastruct/arbiter.h:386
 5 0x00000000000218fb ucs_callbackq_slow_proxy()  /tmp/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/ucs/datastruct/callbackq.c:404
 6 0x000000000004465a ucs_callbackq_dispatch()  /tmp/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/ucs/datastruct/callbackq.h:211
 7 0x000000000004465a uct_worker_progress()  /tmp/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/uct/api/uct.h:2768
 8 0x000000000004465a ucp_worker_progress()  /tmp/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/ucp/core/ucp_worker.c:2814
 9 0x000000000002e38b opal_progress()  ???:0
10 0x000000000005046d ompi_request_default_wait_all()  ???:0
11 0x000000000008cd6f MPI_Waitall()  ???:0
12 0x00000000000d1d38 custatevec::MPICommPlugin<void*, void*, void*, void*, (anonymous namespace)::ompi_status_public_t>::staticSynchronize()  ompiCommPlugin.cpp:0
13 0x000000000043fdbb custatevecTestMatrixTypeGetWorkspaceSize()  ???:0
14 0x000000000043eeff custatevecTestMatrixTypeGetWorkspaceSize()  ???:0
15 0x0000000000104371 custatevecSVSwapWorkerCreateWithSemaphore()  ???:0
16 0x00000000000ffc98 custatevecSVSwapWorkerExecute()  ???:0
17 0x000000000012c441 ubackend::WireSwapWorkerComm::applyGlobalIndexBitSwaps()  :0
18 0x000000000012c8ce ubackend::WireSwapWorkerComm::swapIndexBits()  :0
19 0x00000000000ae1f6 ubackend::GateApplicatorDistributed::swapIndexBits()  :0
20 0x00000000000ae2b4 ubackend::GateApplicatorDistributed::applyGatesForLocalStateVector()  :0
21 0x00000000000ae62e ubackend::GateApplicatorDistributed::applyGates()  :0
22 0x00000000000ad02d ubackend::GateApplicator::applyQueuedGates()  :0
23 0x0000000000080d8d cudaq::CusvsimCircuitSimulator<float>::flushGateApplicator()  ???:0
24 0x000000000005b4d9 nvqir::CircuitSimulatorBase<float>::resetExecutionContext()  ???:0
25 0x000000000002c793 cudaq::log<>()  ???:0
26 0x000000000005dcd4 cudaq::log<>()  ???:0
27 0x0000000000054582 cudaq::quantum_platform::reset_exec_ctx()  ???:0
28 0x000000000011f613 std::__detail::_Compiler<std::__cxx11::regex_traits<char> >::_M_expression_term<true, true>()  ???:0
29 0x00000000000f06a3 ???()  /software/modules/cudaq/lib/python3.11/site-packages/cudaq/mlir/_mlir_libs/_quakeDialects.cpython-311-x86_64-linux-gnu.so:0
30 0x00000000001e67ed cfunction_call()  /tmp/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Objects/methodobject.c:542
31 0x00000000001c8fef _PyObject_MakeTpCall()  /tmp/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Objects/call.c:214
32 0x00000000001c8fef _PyObject_MakeTpCall()  /tmp/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Objects/call.c:216
33 0x00000000001d1d5c _PyEval_EvalFrameDefault()  /tmp/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Python/ceval.c:4773
34 0x00000000001cd8ea _PyEval_EvalFrame()  /tmp/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/./Include/internal/pycore_ceval.h:73
35 0x00000000001cd8ea _PyEval_Vector()  /tmp/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Python/ceval.c:6443
36 0x0000000000254d61 PyEval_EvalCode()  /tmp/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Python/ceval.c:1154
37 0x0000000000271b33 run_eval_code_obj()  /tmp/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Python/pythonrun.c:1714
38 0x000000000026e40a run_mod()  /tmp/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Python/pythonrun.c:1735
39 0x000000000029290f pyrun_file()  /tmp/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Python/pythonrun.c:1630
40 0x0000000000292564 _PyRun_SimpleFileObject()  /tmp/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Python/pythonrun.c:440
41 0x0000000000292124 _PyRun_AnyFileObject()  /tmp/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Python/pythonrun.c:79
42 0x000000000027c55b pymain_run_file_obj()  /tmp/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Modules/main.c:360
43 0x000000000027c55b pymain_run_file()  /tmp/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Modules/main.c:379
44 0x000000000027c55b pymain_run_python()  /tmp/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Modules/main.c:601
45 0x000000000027c55b Py_RunMain()  /tmp/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Modules/main.c:680
46 0x0000000000245547 Py_BytesMain()  /tmp/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Modules/main.c:734
47 0x000000000002a1ca __libc_init_first()  ???:0
48 0x000000000002a28b __libc_start_main()  ???:0
49 0x0000000000401065 _start()  ???:0
=================================

Steps to reproduce the bug

Code used to reproduce the crash:

import cudaq

cudaq.set_target("nvidia", option="mgpu")
qubit_count = 35
term_count = 10

@cudaq.kernel
def kernel(qubit_count: int):
    qubits = cudaq.qvector(qubit_count)
    h(qubits[0])
    for i in range(1, qubit_count):
        cx(qubits[0], qubits[i])
counts = cudaq.sample(kernel, qubit_count)
if (cudaq.mpi.rank()==0):
    counts.dump()

Expected behavior

No crash with correct output.

Is this a regression? If it is, put the last known working version (or commit) here.

Not a regression

Environment

  • Container: nvcr.io/nvidia/quantum/cuda-quantum:cu12-0.10.0
  • CUDA-Q version: 0.10.0
  • Python version: 3.10.12
  • C++ compiler: Not sure, using a container
  • Operating system: DGXOS 6.3.1

Suggestions

No response

pioch-02 avatar May 09 '25 11:05 pioch-02

Try setting the random seed explicitly in your code "cudaq.set_random_seed(123)" after you import cudaq or after you set the target.

marwafar avatar May 09 '25 12:05 marwafar

When I set it before setting target, i.e.:

import cudaq
cudaq.set_random_seed(123)
cudaq.set_target("nvidia", option="mgpu")

then nothing changes (the same errors for both container and modules). If I move it after setting target, i.e.:

import cudaq
cudaq.set_target("nvidia", option="mgpu")
cudaq.set_random_seed(123)

then module version still has the same errors but the container crashes with different message:

python: /root/.llvm-project/llvm/include/llvm/Support/CommandLine.h:864: void llvm::cl::parser<DataType>::addLiteralOption(llvm::StringRef, const DT&, llvm::StringRef) [with DT = llvm::FunctionPass* (*)(); DataType = llvm::FunctionPass* (*)()]: Assertion `findOption(Name) == Values.size() && "Option already exists!"' failed.
*** Process received signal ***
Signal: Aborted (6)
Signal code:  (-6)
[ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x1555551f0520]
[ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x1555552449fc]
[ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x1555551f0476]
[ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x1555551d67f3]
[ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x1555551d671b]
[ 5] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x1555551e7e96]
*** Process received signal ***
Signal: Aborted (6)
Signal code:  (-6)

pioch-02 avatar May 09 '25 14:05 pioch-02

@1tnguyen can you take a look at this please?

zohimchandani avatar May 09 '25 17:05 zohimchandani

Where is libcudaq_distributed_interface_mpi.so from?

mitchdz avatar May 09 '25 17:05 mitchdz

Ran the container interactively and built it within it. Also tried with a file built against the modules but used with the container and all possible permutations, the results were always the same.

pioch-02 avatar May 09 '25 17:05 pioch-02

Can you try this too and paste the error:

import cudaq

cudaq.set_target("nvidia", option="mgpu")
cudaq.set_random_seed(123)

cudaq.mpi.initialize()

qubit_count = 35
term_count = 10

@cudaq.kernel
def kernel(qubit_count: int):
    qubits = cudaq.qvector(qubit_count)
    h(qubits[0])
    for i in range(1, qubit_count):
        cx(qubits[0], qubits[i])
counts = cudaq.sample(kernel, qubit_count)
if (cudaq.mpi.rank()==0):
    counts.dump()

cudaq.mpi.finalize()


zohimchandani avatar May 09 '25 18:05 zohimchandani

Sure thing, removed the repeating error, here's the beginning and end:

python: /root/.llvm-project/llvm/include/llvm/Support/CommandLine.h:864: void llvm::cl::parser<DataType>::addLiteralOption(llvm::StringRef, const DT&, llvm::StringRef) [with D
T = llvm::FunctionPass* (*)(); DataType = llvm::FunctionPass* (*)()]: Assertion `findOption(Name) == Values.size() && "Option already exists!"' failed.
*** Process received signal ***
Signal: Aborted (6)
Signal code:  (-6)
[ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x1555551f0520]
[ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x1555552449fc]
[ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x1555551f0476]
[ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x1555551d67f3]
[ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x1555551d671b]
[ 5] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x1555551e7e96]
[ 6] /opt/nvidia/cudaq/cudaq/mlir/_mlir_libs/libCUDAQuantumPythonCAPI.so(+0x45ce133)[0x1554c6c53133]
[ 7] /opt/nvidia/cudaq/lib/libcudaq-mlir-runtime.so(+0xa2592e)[0x15515325792e]
[ 8] /lib64/ld-linux-x86-64.so.2(+0x647e)[0x15555552047e]
[ 9] /lib64/ld-linux-x86-64.so.2(+0x6568)[0x155555520568]
[10] /usr/lib/x86_64-linux-gnu/libc.so.6(_dl_catch_exception+0xe5)[0x155555322af5]
[11] /lib64/ld-linux-x86-64.so.2(+0xdff6)[0x155555527ff6]
[12] /usr/lib/x86_64-linux-gnu/libc.so.6(_dl_catch_exception+0x88)[0x155555322a98]
[13] /lib64/ld-linux-x86-64.so.2(+0xe34e)[0x15555552834e]
[14] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9063c)[0x15555523e63c]
[15] /usr/lib/x86_64-linux-gnu/libc.so.6(_dl_catch_exception+0x88)[0x155555322a98]
[16] /usr/lib/x86_64-linux-gnu/libc.so.6(_dl_catch_error+0x33)[0x155555322b63]
[17] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9012e)[0x15555523e12e]
[18] /usr/lib/x86_64-linux-gnu/libc.so.6(dlopen+0x48)[0x15555523e6c8]
[19] /opt/nvidia/cudaq/lib/libcudaq.so(_ZN5cudaq9MPIPluginC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x50)[0x1554c1af1680]
[20] /opt/nvidia/cudaq/lib/libcudaq.so(_ZN5cudaq3mpi12getMpiPluginEb+0x12d)[0x1554c1a7e6bd]
[21] /opt/nvidia/cudaq/lib/libcudaq.so(_ZN5cudaq3mpi10initializeEv+0xf)[0x1554c1a7ef8f]
[22] /opt/nvidia/cudaq/cudaq/mlir/_mlir_libs/_quakeDialects.cpython-310-x86_64-linux-gnu.so(+0x10b7a5)[0x1554c1cf87a5]
[23] /opt/nvidia/cudaq/cudaq/mlir/_mlir_libs/_quakeDialects.cpython-310-x86_64-linux-gnu.so(+0xfafb3)[0x1554c1ce7fb3]
[24] python(+0x18ae12)[0x5555556dee12]
[25] python(_PyObject_MakeTpCall+0x25b)[0x5555556d575b]
[26] python(_PyEval_EvalFrameDefault+0x5f66)[0x5555556cf1d6]
[27] python(+0x259f56)[0x5555557adf56]
[28] python(PyEval_EvalCode+0x86)[0x5555557ade26]
[29] python(+0x280808)[0x5555557d4808]
*** End of error message ***
[...]
/usr/bin/bash: line 1: 1346009 Aborted                 (core dumped) python mgpu2.py
slurmstepd: error:  mpi/pmix_v4: _errhandler: node001 [0]: pmixp_client_v2.c:211: Error handler invoked: status = -61, source = [slurm.pmix.48929.0:2]
slurmstepd: error: *** STEP 48929.0 ON node001 CANCELLED AT 2025-05-10T01:04:44 ***

pioch-02 avatar May 09 '25 23:05 pioch-02

Can you please try the latest image: docker pull nvcr.io/nvidia/nightly/cuda-quantum:cu12-latest

A recent PR in the latest image may fix this issue. Thanks

zohimchandani avatar May 10 '25 09:05 zohimchandani

Still crashes - I've created new .so file and used both the original code and the suggested version, crashed in both cases:

python: /root/.llvm-project/llvm/include/llvm/Support/CommandLine.h:864: void llvm::cl::parser<DataType>::addLiteralOption(llvm::StringRef, const DT&, llvm::StringRef) [with DT = llvm::FunctionPass* (*)(); DataType = llvm::FunctionPass* (*)()]: Assertion `findOption(Name) == Values.size() && "Option already exists!"' failed.
*** Process received signal ***
Signal: Aborted (6)
Signal code:  (-6)
[ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x1555551f0520]
[ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x1555552449fc]
[ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x1555551f0476]
[ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x1555551d67f3]
[ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x1555551d671b]
[ 5] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x1555551e7e96]
[ 6] /opt/nvidia/cudaq/cudaq/mlir/_mlir_libs/libCUDAQuantumPythonCAPI.so(+0x6c223c3)[0x1554c925c3c3]
[ 7] /opt/nvidia/cudaq/lib/libcudaq-mlir-runtime.so(+0x3d66f8e)[0x15515655cf8e]
[ 8] /lib64/ld-linux-x86-64.so.2(+0x647e)[0x15555552047e]
[ 9] /lib64/ld-linux-x86-64.so.2(+0x6568)[0x155555520568]
[10] /usr/lib/x86_64-linux-gnu/libc.so.6(_dl_catch_exception+0xe5)[0x155555322af5]
[11] /lib64/ld-linux-x86-64.so.2(+0xdff6)[0x155555527ff6]
[12] /usr/lib/x86_64-linux-gnu/libc.so.6(_dl_catch_exception+0x88)[0x155555322a98]
[13] /lib64/ld-linux-x86-64.so.2(+0xe34e)[0x15555552834e]
[14] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9063c)[0x15555523e63c]
[15] /usr/lib/x86_64-linux-gnu/libc.so.6(_dl_catch_exception+0x88)[0x155555322a98]
[16] /usr/lib/x86_64-linux-gnu/libc.so.6(_dl_catch_error+0x33)[0x155555322b63]
[17] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9012e)[0x15555523e12e]
[18] /usr/lib/x86_64-linux-gnu/libc.so.6(dlopen+0x48)[0x15555523e6c8]
[19] /opt/nvidia/cudaq/lib/libcudaq.so(_ZN5cudaq9MPIPluginC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x58)[0x1554c17f0478]
[20] /opt/nvidia/cudaq/lib/libcudaq.so(_ZN5cudaq3mpi12getMpiPluginEb+0x12d)[0x1554c176b5dd]
[21] /opt/nvidia/cudaq/lib/libcudaq.so(_ZN5cudaq3mpi10initializeEv+0xf)[0x1554c176bedf]
[22] /opt/nvidia/cudaq/cudaq/mlir/_mlir_libs/_quakeDialects.cpython-310-x86_64-linux-gnu.so(+0x299335)[0x1554c1b89335]
[23] /opt/nvidia/cudaq/cudaq/mlir/_mlir_libs/_quakeDialects.cpython-310-x86_64-linux-gnu.so(+0x2cc84a)[0x1554c1bbc84a]
[24] python(+0x18ae12)[0x5555556dee12]
[25] python(_PyObject_MakeTpCall+0x25b)[0x5555556d575b]
[26] python(_PyEval_EvalFrameDefault+0x5f66)[0x5555556cf1d6]
[27] python(+0x259f56)[0x5555557adf56]
[28] python(PyEval_EvalCode+0x86)[0x5555557ade26]
[29] python(+0x280808)[0x5555557d4808]
*** End of error message ***
[...]
slurmstepd: error:  mpi/pmix_v4: _errhandler: node001 [1]: pmixp_client_v2.c:211: Error handler invoked: status = -61, source = [slurm.pmix.48958.0:13]

pioch-02 avatar May 10 '25 10:05 pioch-02

This might be a shot in the dark, but I was digging for what functions might be duplicated. My search started with finding the llvm RegisterPass calls

$ nm -C --defined-only /opt/nvidia/cudaq/**/*.so | grep 'RegisterPass<'
00000000030407c0 t llvm::RegisterPass<(anonymous namespace)::DebugifyModulePass>::~RegisterPass()
00000000030407c0 t llvm::RegisterPass<(anonymous namespace)::DebugifyModulePass>::~RegisterPass()
0000000003040780 t llvm::RegisterPass<(anonymous namespace)::DebugifyFunctionPass>::~RegisterPass()
0000000003040780 t llvm::RegisterPass<(anonymous namespace)::DebugifyFunctionPass>::~RegisterPass()
00000000030407a0 t llvm::RegisterPass<(anonymous namespace)::CheckDebugifyModulePass>::~RegisterPass()
00000000030407a0 t llvm::RegisterPass<(anonymous namespace)::CheckDebugifyModulePass>::~RegisterPass()
0000000003040760 t llvm::RegisterPass<(anonymous namespace)::CheckDebugifyFunctionPass>::~RegisterPass()
0000000003040760 t llvm::RegisterPass<(anonymous namespace)::CheckDebugifyFunctionPass>::~RegisterPass()

Which I can see multiple DebugifyModulePass

$ nm -C --defined-only /opt/nvidia/cudaq/**/*.so | grep DebugifyModulePass | cut -d: -f1 | sort | uniq -c
      2 00000000030406c0 t (anonymous namespace)
      2 00000000030407a0 t llvm
      2 00000000030407c0 t llvm
      1 0000000003040840 t llvm
      1 00000000030408a0 t llvm
      2 00000000030409c0 t (anonymous namespace)
      1 00000000030409e0 t (anonymous namespace)
      2 0000000003040a20 t (anonymous namespace)
      1 0000000003040a40 t (anonymous namespace)
      1 0000000003041c70 T createDebugifyModulePass(DebugifyMode, llvm
      1 0000000003041e40 T createCheckDebugifyModulePass(bool, llvm
      1 00000000030479f0 t (anonymous namespace)
      1 000000000304b040 t (anonymous namespace)
      1 00000000052756a0 d vtable for (anonymous namespace)
      1 00000000052757e0 d vtable for (anonymous namespace)
      1 0000000005310700 b (anonymous namespace)
      1 00000000053107d8 b (anonymous namespace)

Finding which .so files have this link:

$ find /opt/nvidia/cudaq -name "*.so" | while read sofile; do
  nm -C --defined-only "$sofile" 2>/dev/null | grep -q DebugifyModulePass && echo "$sofile"
done
/opt/nvidia/cudaq/cudaq/mlir/_mlir_libs/libCUDAQuantumPythonCAPI.so
/opt/nvidia/cudaq/lib/libcudaq-mlir-runtime.so

I noticed that libcudaq-mlir-runtime.so is linked against the mpi ELF

cudaq@2570547-lcedt:/tmp$ nvq++ -shared -fPIC /opt/nvidia/cudaq/distributed_interfaces/mpi_comm_impl.cpp -o libmpi.so
cudaq@2570547-lcedt:/tmp$ readelf -d libmpi.so | grep libcudaq-mlir-runtime.so
 0x0000000000000001 (NEEDED)             Shared library: [libcudaq-mlir-runtime.so]

Which this doesn't seem right. You can disable linking against the mlir-runtime with --disable-mlir-links Which results in:

cudaq@2570547-lcedt:/tmp$ nvq++ --disable-mlir-links -shared -fPIC /opt/nvidia/cudaq/distributed_interfaces/mpi_comm_impl.cpp -o libmpi.so
cudaq@2570547-lcedt:/tmp$ readelf -d libmpi.so | grep libcudaq-mlir-runtime.so
cudaq@2570547-lcedt:/tmp$ 

When generating libcudaq_distributed_interface_mpi.so could you try to modify activate_custom_mpi.sh to add --disable-mlir-links such as:

$CXX -shared -std=c++17 -fPIC --disable-mlir-links \
    -I"${MPI_PATH}/include" \
    -I"$this_file_dir" \
    "$this_file_dir/mpi_comm_impl.cpp" \
    -L"${MPI_PATH}/lib64" -L"${MPI_PATH}/lib" -lmpi \
    -Wl,-rpath="${MPI_PATH}/lib64" -Wl,-rpath="${MPI_PATH}/lib" \
    -o "$lib_mpi_plugin"

mitchdz avatar May 12 '25 22:05 mitchdz

@mitchdz This seems to have solved the issue: { 0000000000000000000000000000000000000000:486 1111111111111111111111111111111111111111:514 }

Thank you very much!

pioch-02 avatar May 13 '25 07:05 pioch-02

Re-opening so we can associate with the PR to fix it.

bmhowe23 avatar May 13 '25 14:05 bmhowe23