ompi
ompi copied to clipboard
Installation problem with HWLOC
The newest openmpi-501 automatically install the libraries. I wonder if I can pass the --disable-rsmi parameter to the auto configure command. Because I don't have rsmi liberary and I don't need it. I tried to use configure.ac to regenerate the configure file, but it failed.
I'm sorry, I don't understand your question -- Open MPI has no --disable-rsmi
configure parameter. Can you please provide all the information that was requested in the github issue template?
@OTTHao Ping. Can you provide more information, per my above comment? Thanks.
@OTTHao Ping. Can you provide more information, per my above comment? Thanks.
I‘m sorry about that. I compiled HWLOC manuelly and I was busing with the work after that. I plan to compile openmp today, and I'll tell you about the information later.
error.zip The server do not support CUDA, I'm wondering if I can disable-rsmi when configuring openmpi to avoid this problem?
From your hwloc``config.log
configure:27839: checking for rocm_smi/rocm_smi.h
configure:27839: gcc -c -g -O2 -I/opt/rocm/rocm_smi/include/ conftest.c >&5
configure:27839: $? = 0
configure:27839: result: yes
configure:27847: checking for rsmi_init in -lrocm_smi64
configure:27870: gcc -o conftest -g -O2 -I/opt/rocm/rocm_smi/include/ -L/opt/rocm/rocm_smi/lib/ conftest.c -lrocm_smi64 >&5
configure:27870: $? = 0
configure:27880: result: yes
configure:27884: checking whether a program linked with -lrocm_smi64 can run
configure:27911: gcc -o conftest -g -O2 -I/opt/rocm/rocm_smi/include/ -L/opt/rocm/rocm_smi/lib/ conftest.c -lrocm_smi64 >&5
configure:27911: $? = 0
configure:27911: ./conftest
./conftest: error while loading shared libraries: librocm_smi64.so.2: cannot open shared object file: No such file or directory
That strongly suggests the rocm_smi
stuff is available on your build system, but /opt/rocm/rocm_smi/lib
is not in your $LD_LIBRARY_PATH
From your
hwloc
config.log ``configure:27839: checking for rocm_smi/rocm_smi.h configure:27839: gcc -c -g -O2 -I/opt/rocm/rocm_smi/include/ conftest.c >&5 configure:27839: $? = 0 configure:27839: result: yes configure:27847: checking for rsmi_init in -lrocm_smi64 configure:27870: gcc -o conftest -g -O2 -I/opt/rocm/rocm_smi/include/ -L/opt/rocm/rocm_smi/lib/ conftest.c -lrocm_smi64 >&5 configure:27870: $? = 0 configure:27880: result: yes configure:27884: checking whether a program linked with -lrocm_smi64 can run configure:27911: gcc -o conftest -g -O2 -I/opt/rocm/rocm_smi/include/ -L/opt/rocm/rocm_smi/lib/ conftest.c -lrocm_smi64 >&5 configure:27911: $? = 0 configure:27911: ./conftest ./conftest: error while loading shared libraries: librocm_smi64.so.2: cannot open shared object file: No such file or directory
That strongly suggests the
rocm_smi
stuff is available on your build system, but/opt/rocm/rocm_smi/lib
is not in your$LD_LIBRARY_PATH
Okay, I think I know about the problem. Plus: I still recommand the option --disable-rsmi in configuring openmpi.
On second thought, it seems the real issue is ROCm
is indeed detected and flagged as usable but it cannot be used because its API is incompatible with hwloc 2.7.1
.
As a workaround, you can try to export enable_rmsi=no
before invoking configure
That being saidm the cleanest option for now is you build your own hwloc
and have Open MPI use it (e.g. configure --with-hwloc=...
)
@bgoglin In this environment, ROCm is detected but cannot be used (that sound like an incompatible API)
Is this something you are aware of?
Has it been fixed in the hwloc
v2
series? (e.g. does hwloc
supports this API or configure
disable ROCm
support?)
topology-rsmi.c: In function ‘get_device_xgmi_hive_id’:
topology-rsmi.c:193:27: warning: implicit declaration of function ‘rsmi_dev_xgmi_hive_id_get’; did you mean ‘rsmi_dev_unique_id_get’? [-Wimplicit-function-declaration]
193 | rsmi_status_t rsmi_rc = rsmi_dev_xgmi_hive_id_get(dv_ind, &hive_id);
| ^~~~~~~~~~~~~~~~~~~~~~~~~
| rsmi_dev_unique_id_get
topology-rsmi.c: At top level:
topology-rsmi.c:215:36: error: unknown type name ‘RSMI_IO_LINK_TYPE’
215 | RSMI_IO_LINK_TYPE *type, uint64_t *hops)
| ^~~~~~~~~~~~~~~~~
topology-rsmi.c: In function ‘hwloc_rsmi_discover’:
topology-rsmi.c:344:9: error: unknown type name ‘RSMI_IO_LINK_TYPE’
344 | RSMI_IO_LINK_TYPE type;
| ^~~~~~~~~~~~~~~~~
topology-rsmi.c:348:14: warning: implicit declaration of function ‘get_device_io_link_type’; did you mean ‘get_device_pci_linkspeed’? [-Wimplicit-function-declaration]
348 | if ((get_device_io_link_type(i, j, &type, &hops) == 0) &&
| ^~~~~~~~~~~~~~~~~~~~~~~
| get_device_pci_linkspeed
topology-rsmi.c:349:22: error: ‘RSMI_IOLINK_TYPE_XGMI’ undeclared (first use in this function); did you mean ‘RSMI_CLK_TYPE_MEM’?
349 | (type == RSMI_IOLINK_TYPE_XGMI)) {
| ^~~~~~~~~~~~~~~~~~~~~
| RSMI_CLK_TYPE_MEM
topology-rsmi.c:349:22: note: each undeclared identifier is reported only once for each function it appears in
make[3]: *** [hwloc_rsmi_la-topology-rsmi.lo] Error 1
Hello. The XGMI API seems to be still available in latest ROCm. So maybe the ROCm API is very old instead? What ROCm version is this? IIRC this was added in 3.6 released 4 years ago.
configure:27839: checking for rocm_smi/rocm_smi.h configure:27839: gcc -c -g -O2 -I/opt/rocm/rocm_smi/include/ conftest.c >&5 configure:27839: $? = 0 configure:27839: result: yes configure:27847: checking for rsmi_init in -lrocm_smi64 configure:27870: gcc -o conftest -g -O2 -I/opt/rocm/rocm_smi/include/ -L/opt/rocm/rocm_smi/lib/ conftest.c -lrocm_smi64 >&5 configure:27870: $? = 0 configure:27880: result: yes configure:27884: checking whether a program linked with -lrocm_smi64 can run configure:27911: gcc -o conftest -g -O2 -I/opt/rocm/rocm_smi/include/ -L/opt/rocm/rocm_smi/lib/ conftest.c -
I am wondering whether this is a ROCm 6.0 issue. The location of the rocm_smi.h file is not /opt/rocm/rocm_smi/include/, but /opt/rocm/include/rocm_smi/ and the library is in /opt/rocm/lib (not /opt/rocm/rocm_smi/lib). That change was made in ROCm 5.x series, but there were some soft links set for backward compatibility reasons. These backward compatibility links have been removed in ROCm6.0 on. I will check.
@edgargabriel I installed ROCm 6.0.2 (rocm-core and rocm-smi-libs ubuntu22 packages) on my laptop. hwloc detects/builds fine on top of it. Support for new ROCm locations were actually added to hwloc in 2.9.1. Older releases just don't detect/enable ROCm at all. @OTTHao which ROCm version are you using?
@bgoglin you are correct, it works with ROCm 6.0.2. ROCm 6.0.0 and 6.0.1 had unfortunately a bug in the rocm_smi.h header file that prevented compilation of hwloc (or any C code for that matter, it worked with C++). It would be good to know what ROCm version is being used in this report.
It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.
Per the above comment, it has been a month with no reply on this issue. It looks like this issue has been abandoned.
I'm going to close this issue. If I'm wrong and this issue is not abandoned, please feel free to re-open it. Thank you!