ompi icon indicating copy to clipboard operation
ompi copied to clipboard

Installation problem with HWLOC

Open OTTHao opened this issue 1 year ago • 12 comments

The newest openmpi-501 automatically install the libraries. I wonder if I can pass the --disable-rsmi parameter to the auto configure command. Because I don't have rsmi liberary and I don't need it. I tried to use configure.ac to regenerate the configure file, but it failed.

OTTHao avatar Jan 08 '24 09:01 OTTHao

I'm sorry, I don't understand your question -- Open MPI has no --disable-rsmi configure parameter. Can you please provide all the information that was requested in the github issue template?

jsquyres avatar Jan 08 '24 18:01 jsquyres

@OTTHao Ping. Can you provide more information, per my above comment? Thanks.

jsquyres avatar Jan 30 '24 16:01 jsquyres

@OTTHao Ping. Can you provide more information, per my above comment? Thanks.

I‘m sorry about that. I compiled HWLOC manuelly and I was busing with the work after that. I plan to compile openmp today, and I'll tell you about the information later.

OTTHao avatar Jan 31 '24 01:01 OTTHao

error.zip The server do not support CUDA, I'm wondering if I can disable-rsmi when configuring openmpi to avoid this problem?

OTTHao avatar Jan 31 '24 02:01 OTTHao

From your hwloc``config.log

configure:27839: checking for rocm_smi/rocm_smi.h
configure:27839: gcc -c -g -O2  -I/opt/rocm/rocm_smi/include/ conftest.c >&5
configure:27839: $? = 0
configure:27839: result: yes
configure:27847: checking for rsmi_init in -lrocm_smi64
configure:27870: gcc -o conftest -g -O2  -I/opt/rocm/rocm_smi/include/  -L/opt/rocm/rocm_smi/lib/ conftest.c -lrocm_smi64   >&5
configure:27870: $? = 0
configure:27880: result: yes
configure:27884: checking whether a program linked with -lrocm_smi64 can run
configure:27911: gcc -o conftest -g -O2  -I/opt/rocm/rocm_smi/include/  -L/opt/rocm/rocm_smi/lib/ conftest.c  -lrocm_smi64 >&5
configure:27911: $? = 0
configure:27911: ./conftest
./conftest: error while loading shared libraries: librocm_smi64.so.2: cannot open shared object file: No such file or directory

That strongly suggests the rocm_smi stuff is available on your build system, but /opt/rocm/rocm_smi/lib is not in your $LD_LIBRARY_PATH

ggouaillardet avatar Jan 31 '24 02:01 ggouaillardet

From your hwlocconfig.log ``

configure:27839: checking for rocm_smi/rocm_smi.h
configure:27839: gcc -c -g -O2  -I/opt/rocm/rocm_smi/include/ conftest.c >&5
configure:27839: $? = 0
configure:27839: result: yes
configure:27847: checking for rsmi_init in -lrocm_smi64
configure:27870: gcc -o conftest -g -O2  -I/opt/rocm/rocm_smi/include/  -L/opt/rocm/rocm_smi/lib/ conftest.c -lrocm_smi64   >&5
configure:27870: $? = 0
configure:27880: result: yes
configure:27884: checking whether a program linked with -lrocm_smi64 can run
configure:27911: gcc -o conftest -g -O2  -I/opt/rocm/rocm_smi/include/  -L/opt/rocm/rocm_smi/lib/ conftest.c  -lrocm_smi64 >&5
configure:27911: $? = 0
configure:27911: ./conftest
./conftest: error while loading shared libraries: librocm_smi64.so.2: cannot open shared object file: No such file or directory

That strongly suggests the rocm_smi stuff is available on your build system, but /opt/rocm/rocm_smi/lib is not in your $LD_LIBRARY_PATH

Okay, I think I know about the problem. Plus: I still recommand the option --disable-rsmi in configuring openmpi.

OTTHao avatar Jan 31 '24 02:01 OTTHao

On second thought, it seems the real issue is ROCm is indeed detected and flagged as usable but it cannot be used because its API is incompatible with hwloc 2.7.1.

As a workaround, you can try to export enable_rmsi=no before invoking configure That being saidm the cleanest option for now is you build your own hwloc and have Open MPI use it (e.g. configure --with-hwloc=...)

ggouaillardet avatar Jan 31 '24 05:01 ggouaillardet

@bgoglin In this environment, ROCm is detected but cannot be used (that sound like an incompatible API)

Is this something you are aware of? Has it been fixed in the hwloc v2 series? (e.g. does hwloc supports this API or configure disable ROCm support?)

topology-rsmi.c: In function ‘get_device_xgmi_hive_id’:
topology-rsmi.c:193:27: warning: implicit declaration of function ‘rsmi_dev_xgmi_hive_id_get’; did you mean ‘rsmi_dev_unique_id_get’? [-Wimplicit-function-declaration]
  193 |   rsmi_status_t rsmi_rc = rsmi_dev_xgmi_hive_id_get(dv_ind, &hive_id);
      |                           ^~~~~~~~~~~~~~~~~~~~~~~~~
      |                           rsmi_dev_unique_id_get
topology-rsmi.c: At top level:
topology-rsmi.c:215:36: error: unknown type name ‘RSMI_IO_LINK_TYPE’
  215 |                                    RSMI_IO_LINK_TYPE *type, uint64_t *hops)
      |                                    ^~~~~~~~~~~~~~~~~
topology-rsmi.c: In function ‘hwloc_rsmi_discover’:
topology-rsmi.c:344:9: error: unknown type name ‘RSMI_IO_LINK_TYPE’
  344 |         RSMI_IO_LINK_TYPE type;
      |         ^~~~~~~~~~~~~~~~~
topology-rsmi.c:348:14: warning: implicit declaration of function ‘get_device_io_link_type’; did you mean ‘get_device_pci_linkspeed’? [-Wimplicit-function-declaration]
  348 |         if ((get_device_io_link_type(i, j, &type, &hops) == 0) &&
      |              ^~~~~~~~~~~~~~~~~~~~~~~
      |              get_device_pci_linkspeed
topology-rsmi.c:349:22: error: ‘RSMI_IOLINK_TYPE_XGMI’ undeclared (first use in this function); did you mean ‘RSMI_CLK_TYPE_MEM’?
  349 |             (type == RSMI_IOLINK_TYPE_XGMI)) {
      |                      ^~~~~~~~~~~~~~~~~~~~~
      |                      RSMI_CLK_TYPE_MEM
topology-rsmi.c:349:22: note: each undeclared identifier is reported only once for each function it appears in
make[3]: *** [hwloc_rsmi_la-topology-rsmi.lo] Error 1

ggouaillardet avatar Jan 31 '24 05:01 ggouaillardet

Hello. The XGMI API seems to be still available in latest ROCm. So maybe the ROCm API is very old instead? What ROCm version is this? IIRC this was added in 3.6 released 4 years ago.

bgoglin avatar Feb 01 '24 08:02 bgoglin

configure:27839: checking for rocm_smi/rocm_smi.h configure:27839: gcc -c -g -O2 -I/opt/rocm/rocm_smi/include/ conftest.c >&5 configure:27839: $? = 0 configure:27839: result: yes configure:27847: checking for rsmi_init in -lrocm_smi64 configure:27870: gcc -o conftest -g -O2 -I/opt/rocm/rocm_smi/include/ -L/opt/rocm/rocm_smi/lib/ conftest.c -lrocm_smi64 >&5 configure:27870: $? = 0 configure:27880: result: yes configure:27884: checking whether a program linked with -lrocm_smi64 can run configure:27911: gcc -o conftest -g -O2 -I/opt/rocm/rocm_smi/include/ -L/opt/rocm/rocm_smi/lib/ conftest.c -

I am wondering whether this is a ROCm 6.0 issue. The location of the rocm_smi.h file is not /opt/rocm/rocm_smi/include/, but /opt/rocm/include/rocm_smi/ and the library is in /opt/rocm/lib (not /opt/rocm/rocm_smi/lib). That change was made in ROCm 5.x series, but there were some soft links set for backward compatibility reasons. These backward compatibility links have been removed in ROCm6.0 on. I will check.

edgargabriel avatar Feb 01 '24 14:02 edgargabriel

@edgargabriel I installed ROCm 6.0.2 (rocm-core and rocm-smi-libs ubuntu22 packages) on my laptop. hwloc detects/builds fine on top of it. Support for new ROCm locations were actually added to hwloc in 2.9.1. Older releases just don't detect/enable ROCm at all. @OTTHao which ROCm version are you using?

bgoglin avatar Feb 07 '24 10:02 bgoglin

@bgoglin you are correct, it works with ROCm 6.0.2. ROCm 6.0.0 and 6.0.1 had unfortunately a bug in the rocm_smi.h header file that prevented compilation of hwloc (or any C code for that matter, it worked with C++). It would be good to know what ROCm version is being used in this report.

edgargabriel avatar Feb 07 '24 14:02 edgargabriel

It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.

github-actions[bot] avatar Feb 21 '24 17:02 github-actions[bot]

Per the above comment, it has been a month with no reply on this issue. It looks like this issue has been abandoned.

I'm going to close this issue. If I'm wrong and this issue is not abandoned, please feel free to re-open it. Thank you!

github-actions[bot] avatar Mar 06 '24 17:03 github-actions[bot]