ADIOS2 icon indicating copy to clipboard operation
ADIOS2 copied to clipboard

SST reader stuck when using RDMA

Open abhishek1297 opened this issue 1 year ago • 13 comments

I am trying to run basic SST related examples from examples/hello/ Reader: sstReader/sstReader.py Writer: sstWriter/sstWriter.py

But, the SST reader always gets stuck when using RDMA data transport.

Installations

I am currently using a conda environment for adios2 python bindings. Here's what I do on the cluster,

>> module load \
    conda/23.5.0 \
    cmake/3.23.3_gcc-10.4.0 \
    openmpi/4.1.5_gcc-10.4.0 \
    gcc/10.4.0_gcc-10.4.0
>> module list
Currently Loaded Modules:
  1) conda/23.5.0                   7) singularity/3.8.7_gcc-10.4.0
  2) cmake/3.23.3_gcc-10.4.0        8) cuda/11.7.1_gcc-10.4.0
  3) libfabric/1.15.1_gcc-10.4.0    9) rdma-core/41.0_gcc-10.4.0
  4) opa-psm2/11.2.230_gcc-10.4.0  10) ucx/1.13.1_gcc-10.4.0
  5) pmix/4.1.2_gcc-10.4.0         11) openmpi/4.1.5_gcc-10.4.0
  6) go/1.18_gcc-10.4.0            12) gcc/10.4.0_gcc-10.4.0
>> conda create -n adios python=3.10 zeromq=4.3.4 -y

Letting mpi4py use existing OpenMPI

>> conda activate adios
>> echo $MPICC
/grid5000/spack/v1/opt/spack/linux-debian11-x86_64_v2/gcc-10.4.0/openmpi-4.1.5-34kj6dkmk4pg3e3nqniaidqj7l2rkkww/bin/mpicc
>> pip3 install --no-binary :all: mpi4py

The OpenMPI module has a support for libfabric as well as ucx.

>> ompi_info
MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.1.5)
MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA btl: ofi (MCA v2.1.0, API v3.1.0, Component v4.1.5)
MCA mtl: ofi (MCA v2.1.0, API v2.0.0, Component v4.1.5)

With the loaded modules, I build adios2 from source. I have attached the output.log from CMake build.

>> export CMAKE_PREFIX_PATH=$CMAKE_PREFIX_PATH:$CONDA_PREFIX
>> cmake -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX -DADIOS2_BUILD_EXAMPLES=ON ..
>> make -j12
>> make install

Running with UCX

Updating both files, SST filepath to ../helloSst io.set_parameter("DataTransport", "ucx")

Writer

>> mpirun -mca pml ucx -n 1 python3 sstWriter.py
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           gros-12
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
DP Writer 0 (0x2729870): UCX init Success
Rank= 0 loop index = 0 data = [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
Rank= 0 loop index = 1 data = [10. 11. 12. 13. 14. 15. 16. 17. 18. 19.]
Rank= 0 loop index = 2 data = [20. 21. 22. 23. 24. 25. 26. 27. 28. 29.]
Rank= 0 loop index = 3 data = [30. 31. 32. 33. 34. 35. 36. 37. 38. 39.]

Reader (Gets all timesteps correctly)

>> mpirun -mca pml ucx -n 1 python3 sstReader.py
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           gros-12
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
DP Reader 0 (0x3163e30): UCX init Success
Rank= 0 loop index = 0 stream step = 0 data = [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
Rank= 0 loop index = 1 stream step = 1 data = [10. 11. 12. 13. 14. 15. 16. 17. 18. 19.]
Rank= 0 loop index = 2 stream step = 2 data = [20. 21. 22. 23. 24. 25. 26. 27. 28. 29.]
Rank= 0 loop index = 3 stream step = 3 data = [30. 31. 32. 33. 34. 35. 36. 37. 38. 39.]

Stuck when using libfabric

Updating both files with io.set_parameter("DataTransport", "fabric") or io.set_parameter("DataTransport", "RDMA"). Here, the writer will wait for the reader, by default. After executing the reader, the writer will start writing but the reader gets stuck in the in the engine.get call or in this example's case, stream.read call. Writer throws a warning when the reader is interrupted.

Writer

>> mpirun -mca btl ofi -n 1 python3 sstWriter.py
Rank= 0 loop index = 0 data = [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
Rank= 0 loop index = 1 data = [10. 11. 12. 13. 14. 15. 16. 17. 18. 19.]
Rank= 0 loop index = 2 data = [20. 21. 22. 23. 24. 25. 26. 27. 28. 29.]
Rank= 0 loop index = 3 data = [30. 31. 32. 33. 34. 35. 36. 37. 38. 39.]
Writer 0 (0x2cb73e0): Got an unexpected connection close event

Reader

>>> mpirun -mca btl ofi -n 1 python3 sstReader.py
# Stuck. No output.
# Keyboard interrupt

abhishek1297 avatar Mar 19 '24 12:03 abhishek1297

Hi Abishek,

You need to tell adios what transport to use. It is not asking MPI what that is using.

Unfortunately in the end of the log, it is not reported if libfabric support is added. Our mistake.

The log says:

Libfabric support for the HPE CXI provider: FALSE

Can you check bpls -Vv and look for LIBFABIC in the supported features.

Thanks

On Tue, Mar 19, 2024, 8:22 AM Abhishek Purandare @.***> wrote:

I am trying to run basic SST related examples from examples/hello/ Reader: sstReader/sstReader.py Writer: sstWriter/sstWriter.py

But, the SST reader always gets stuck when using RDMA data transport. Installations

I am currently using a conda environment for adios2 python bindings. Here's what I do on the cluster,

module load
conda/23.5.0
cmake/3.23.3_gcc-10.4.0
openmpi/4.1.5_gcc-10.4.0
gcc/10.4.0_gcc-10.4.0>> module list Currently Loaded Modules:

  1. conda/23.5.0 7) singularity/3.8.7_gcc-10.4.0
  2. cmake/3.23.3_gcc-10.4.0 8) cuda/11.7.1_gcc-10.4.0
  3. libfabric/1.15.1_gcc-10.4.0 9) rdma-core/41.0_gcc-10.4.0
  4. opa-psm2/11.2.230_gcc-10.4.0 10) ucx/1.13.1_gcc-10.4.0
  5. pmix/4.1.2_gcc-10.4.0 11) openmpi/4.1.5_gcc-10.4.0
  6. go/1.18_gcc-10.4.0 12) gcc/10.4.0_gcc-10.4.0>> conda create -n adios python=3.10 zeromq=4.3.4 -y

Letting mpi4py use existing OpenMPI

conda activate adios>> echo $MPICC /grid5000/spack/v1/opt/spack/linux-debian11-x86_64_v2/gcc-10.4.0/openmpi-4.1.5-34kj6dkmk4pg3e3nqniaidqj7l2rkkww/bin/mpicc>> pip3 install --no-binary :all: mpi4py

The OpenMPI module has a support for libfabric as well as ucx.

ompi_info MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.1.5) MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.1.5) MCA btl: ofi (MCA v2.1.0, API v3.1.0, Component v4.1.5) MCA mtl: ofi (MCA v2.1.0, API v2.0.0, Component v4.1.5)

With the loaded modules, I build adios2 from source. I have attached the output.log https://github.com/ornladios/ADIOS2/files/14650435/output.log from CMake build.

export CMAKE_PREFIX_PATH=$CMAKE_PREFIX_PATH:$CONDA_PREFIX>> cmake -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX -DADIOS2_BUILD_EXAMPLES=ON ..>> make -j12>> make install

Running with UCX

Updating both files, SST filepath to ../helloSst io.set_parameter("DataTransport", "ucx") Writer

mpirun -mca pml ucx -n 1 python3 sstWriter.py


No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port.

Local host: gros-12 Local device: mlx5_0 Local port: 1 CPCs attempted: rdmacm, udcm

DP Writer 0 (0x2729870): UCX init Success Rank= 0 loop index = 0 data = [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.] Rank= 0 loop index = 1 data = [10. 11. 12. 13. 14. 15. 16. 17. 18. 19.] Rank= 0 loop index = 2 data = [20. 21. 22. 23. 24. 25. 26. 27. 28. 29.] Rank= 0 loop index = 3 data = [30. 31. 32. 33. 34. 35. 36. 37. 38. 39.]

Reader (Gets all timesteps correctly)

mpirun -mca pml ucx -n 1 python3 sstReader.py


No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port.

Local host: gros-12 Local device: mlx5_0 Local port: 1 CPCs attempted: rdmacm, udcm

DP Reader 0 (0x3163e30): UCX init Success Rank= 0 loop index = 0 stream step = 0 data = [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.] Rank= 0 loop index = 1 stream step = 1 data = [10. 11. 12. 13. 14. 15. 16. 17. 18. 19.] Rank= 0 loop index = 2 stream step = 2 data = [20. 21. 22. 23. 24. 25. 26. 27. 28. 29.] Rank= 0 loop index = 3 stream step = 3 data = [30. 31. 32. 33. 34. 35. 36. 37. 38. 39.]

Stuck when using libfabric

Updating both files with io.set_parameter("DataTransport", "fabric") or io.set_parameter("DataTransport", "RDMA"). Here, the writer will wait for the reader, by default. After executing the reader, the writer will start writing but the reader gets stuck in the in the engine.get call or in this example's case, stream.read call. Writer throws a warning when the reader is interrupted. Writer

mpirun -mca btl ofi -n 1 python3 sstWriter.py Rank= 0 loop index = 0 data = [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.] Rank= 0 loop index = 1 data = [10. 11. 12. 13. 14. 15. 16. 17. 18. 19.] Rank= 0 loop index = 2 data = [20. 21. 22. 23. 24. 25. 26. 27. 28. 29.] Rank= 0 loop index = 3 data = [30. 31. 32. 33. 34. 35. 36. 37. 38. 39.] Writer 0 (0x2cb73e0): Got an unexpected connection close event

Reader

mpirun -mca btl ofi -n 1 python3 sstReader.py# Stuck. No output.# Keyboard interrupt

— Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS2/issues/4100, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYYYLN3TEFBRU67NRZYYNLYZAUY5AVCNFSM6AAAAABE5OFJEWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE4TIOBSGQ2TSMI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

pnorbert avatar Mar 19 '24 12:03 pnorbert

Hi,

LIBFABRIC does not appear in bpls -Vv output.

I am not sure I understand but how do we specify the transport in ADIOS2 exactly?

I see that the libfabric support is a bit complicated but I got the "RDMA Transport for Staging: Available" message in the output logs as described from the documentation SST defaults to use libfabric.

Configuration: ADIOS2 uses the CMake find_package() functionality to locate libfabric. CMake will automatically search system libraries, but if you need to specify a libfabric location other than in a default system location you can add a “-DLIBFABRIC_ROOT=” argument to direct CMake to libfabric’s location. If CMake finds libfabric, you should see the line “RDMA Transport for Staging: Available” near the end of the CMake output. This makes the RDMA DataTransport the default for SST data movement. (More information about SST engine parameters like DataTransport appears in the SST engine description.) If instead you see “RDMA Transport for Staging: Unconfigured”, RDMA will not be available to SST.

abhishek1297 avatar Mar 19 '24 13:03 abhishek1297

That doc was written before the UCX support was added. Since it found UCX, the RDMA transport is using that. For some reason the cmake config did not like the libfabric library it found.

pnorbert avatar Mar 19 '24 13:03 pnorbert

Unfortunately it is the nature of libfabric that even if it is available at compile-time, SST may discover that the features available at run-time are not appropriate for our needs. Generally that determination is automatic, that transport is disabled at run-time and we fall back to something else.

But lets take a step back here. A couple of points: UCX is an rdma transport. It's a relatively new addition to SST, and while our naming scheme isn't completely consistent, it's perfectly usable. The "RDMA Transport for Staging: Available" message happens whenever we find libfabric (previously or only direct-RDMA transport) or UCX. So, I'm not sure there's really a problem. Don't force the libfabric transport (which is still called "rdma", despite there being a UCX RDMA alternative) and you should be OK.

eisenhauer avatar Mar 19 '24 13:03 eisenhauer

On Perlmutter I got:

-- Found LIBFABRIC: /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so (Required is at least version "1.6")
-- Checking for module 'cray-drc'
--   No package 'cray-drc' found
-- Could NOT find CrayDRC (missing: CrayDRC_LIBRARIES)
-- Libfabric support for the HPE CXI provider: TRUE

But you got FALSE on your system. The test compile command for cmake/check_libfabric_cxi.c in cmake/DetectOptions.cmake:460 fails for you.

  if(LIBFABRIC_FOUND)
    set(ADIOS2_SST_HAVE_LIBFABRIC TRUE)
    find_package(CrayDRC)
    if(CrayDRC_FOUND)
      set(ADIOS2_SST_HAVE_CRAY_DRC TRUE)
    endif()

    try_compile(ADIOS2_SST_HAVE_CRAY_CXI
      ${ADIOS2_BINARY_DIR}/check_libfabric_cxi
      ${ADIOS2_SOURCE_DIR}/cmake/check_libfabric_cxi.c
      CMAKE_FLAGS
        "-DINCLUDE_DIRECTORIES=${LIBFABRIC_INCLUDE_DIRS}"
        "-DLINK_DIRECTORIES=${LIBFABRIC_LIBRARIES}")
    message(STATUS "Libfabric support for the HPE CXI provider: ${ADIOS2_SST_HAVE_CRAY_CXI}")
  endif()

pnorbert avatar Mar 19 '24 14:03 pnorbert

Okay. Thanks for the details.

Maybe CrayDRC might be the cause of this issue. But, it is not present on the cluster. I suppose I will continue working with UCX rdma.

abhishek1297 avatar Mar 19 '24 14:03 abhishek1297

Hi again,

I want to mention that even when I set ADIOS2_USE_UCX=OFF while also explicitely setting LIBFABRIC_ROOT path (cmake finds it regardless), I get the RDMA Transport for Staging: Available message from CMake and yet LIBFABRIC is not added in the supported features. This might be misleading.

abhishek1297 avatar Mar 20 '24 10:03 abhishek1297

Indeed, I can't see it either. The libfabric option was taken out of the user options and now it does not appear in the list of features even when it is on.

You can see the RDMA Transport for Staging: Available message only if either UCX or LIBFABRIC is on.

As @eisenhauer explained, unfortunately, a successful build with LIBFABRIC does not guarantee that it will work properly. So you have it, but it hangs instead of functioning properly.

pnorbert avatar Mar 20 '24 20:03 pnorbert

Several action items here. One is that the "RDMA Transport for Staging" output needs to be more complex now that we've added more options. Probably it needs to be a list of possibly available RDMA transports, rather than just "Available". That would have at least made it clear that UCX was viable. Maybe we can also put that list in the bpls output.

eisenhauer avatar Mar 20 '24 20:03 eisenhauer

Is there any test or example for the usage of RDMA?

abhishek1297 avatar Mar 21 '24 08:03 abhishek1297

In an ideal world, using RDMA would be completely transparent to the user. You'd specify the SST engine for streaming between reader and writer jobs, start them up (presumably on the same cluster where they can use a shared RDMA network for connectivity), SST would connect them and RDMA would be used for the data transfers. You could verify that RDMA was selected by specifying the environment variable SstVerbose=1 or maybe 2, but otherwise you'd just see faster data transfer than you would if you were using TCP.

In practice, things can be a bit more complex. Maybe you're in a batch-only environment, so you need an example batch script for Slurm or LSF (usually you just have to background or more jobs in the script and wait for them at the end). But on some platforms the installed version of libfabric doesn't default to reasonable things and you have to specify environment variables to fix it up (Summit), or the libfabric module is incompatible with other normally-loaded modules (Titan), or the network doesn't let two different jobs talk to each other over the network (what the Cray DRC library above was meant to address), etc. Unfortunately that means that getting stuff to work on any specific machine can require a bit of sleuthing. We can provide some example batch scripts for machines that we've had access to (mostly US HPC platforms) but you might still have to do some digging (which we are happy to help with) on any other machine.

@pnorbert We need to expand our info in read-the-docs about running on specific machines. We've got a tiny bit, for example: https://adios2.readthedocs.io/en/v2.9.2/advanced/ecp_hardware.html But having a variety of example scripts that have worked on specific machines would be a significant help not only to users of those machines, but might give clues to folks trying to work through running on machines we don't have access to.

eisenhauer avatar Mar 21 '24 13:03 eisenhauer

Yes, the logs show that RDMA was picked up by SST. But, that RDMA does not use UCX by default. If I do NOT set DataTransport=UCX on the writer's side, the reader gets blocked.

On the reader's side, even if DataTransport set or not, it will still receive the data as long as the writer is using UCX.

abhishek1297 avatar Mar 22 '24 13:03 abhishek1297

The SST reader will use the transport that the writer selected (it would be nice if they negotiated, but for various technical reasons, that's difficult). I would guess that the libfabric transport looks viable to SST, but then turns out not to function. Sorting out why might require both SST and libfabric verbosity to see what exactly is going on. It may be that libfabric claims to have a feature that turns out not to work or something like that. Libfabric is kind of a frankenstein of features...

eisenhauer avatar Mar 22 '24 13:03 eisenhauer

This issue is stale because it has been 1 year with no activity. Remove stale label or comment or this will be closed in 30 days.

github-actions[bot] avatar Apr 16 '25 03:04 github-actions[bot]