easybuild-easyconfigs icon indicating copy to clipboard operation
easybuild-easyconfigs copied to clipboard

MPI jobs fail with intel toolchains after upgrade of EL8 Linux from 8.5 to 8.6

Open OleHolmNielsen opened this issue 2 years ago • 32 comments

I'm testing the upgrade of our compute nodes from Almalinux 8.5 to 8.6 (the RHEL 8 clone similar to Rocky Linux).

We have found that all MPI codes built with any of the Intel toolchains intel/2020b or intel/2021b fail after the 8.5 to 8.6 upgrade. The codes fail also on login nodes, so the Slurm queue system is not involved. The FOSS toolchains foss/2020b and foss/2021b work perfectly on EL 8.6, however.

My simple test uses the attached trivial MPI Hello World code running on a single node:

$ module load intel/2021b
$ mpicc mpi_hello_world.c
$ mpirun ./a.out

Now the mpirun command enters an infinite loop (running many minutes) and we see these processes with "ps":

/bin/sh /home/modules/software/impi/2021.4.0-intel-compilers-2021.4.0/mpi/2021.4.0/bin/mpirun ./a.out
mpiexec.hydra ./a.out

The mpiexec.hydra process doesn't respond to 15/SIGTERM and I have to kill it with 9/SIGKILL. I've tried to enable debugging output with

export I_MPI_HYDRA_DEBUG=1
export I_MPI_DEBUG=5

but nothing gets printed from this.

Question: Has anyone tried an EL 8.6 Linux with the Intel toolchain and mpiexec.hydra? Can you suggest how I may debug this issue?

OS information:

$ cat /etc/redhat-release
AlmaLinux release 8.6 (Sky Tiger)
$ uname -r
4.18.0-372.9.1.el8.x86_64

OleHolmNielsen avatar Jun 09 '22 09:06 OleHolmNielsen

Quoting some discussion we've had on this in Slack:

no, its not a glibc issue afaics. If you use a RHEL8.5 kernel (with an uptodate RHEL8.6 system on the other side), intelmpi is working

ocaisa avatar Jun 09 '22 09:06 ocaisa

Yikes...

@OleHolmNielsen Have you been in touch with Intel support on this?

@rscohn2 Any thoughts on this?

boegel avatar Jun 09 '22 09:06 boegel

I didn't know that this issue is related to the updated RHEL 8.6 kernel, so I didn't contact Intel support yet. I've never been in touch with Intel compiler/libraries support before, so if someone else knows how to do that, could you kindly open an issue with them? Thanks, Ole

OleHolmNielsen avatar Jun 09 '22 09:06 OleHolmNielsen

We ran into a silent hang issue several years ago too, details in https://github.com/hpcugent/vsc-mympirun/issues/74

Any luck w.r.t. getting output when using mpirun -d?

boegel avatar Jun 09 '22 09:06 boegel

It seems (although nothing to be seen within the kernel release notes) that numa info has changed within the kernel. intelmpi before version 2021.6.0 gets stuck. using pstack, one can see, that the processes seem to hang within an infinite loop somewhere around ipl_detect_machine_topology That happens even before mpiexec.hydra tries to do something with the to be called binary (might it be a.out or hostname).

daRecall avatar Jun 09 '22 09:06 daRecall

@boegel mpiexec.hydra does not know the -d parameter:

$> mpirun -d -np 2 hostname
[[email protected]] match_arg (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:91): unrecognized argument d
[[email protected]] Similar arguments:
[[email protected]]          membind
[[email protected]]          debug
[[email protected]]          dac
[[email protected]]          disable-x
[[email protected]]          demux
[[email protected]] HYD_arg_parse_array (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:128): argument matching returned error
[[email protected]] mpiexec_get_parameters (../../../../../src/pm/i_hydra/mpiexec/mpiexec_params.c:1356): error parsing input array
[[email protected]] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1749): error parsing parameters

but it knows --debug, but the only thing you see, is the called command:

$> mpiexec.hydra --debug -np 2 hostname
[[email protected]] Launch arguments: /cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/skylake_avx512/software/impi/2021.2.0-intel-compilers-2021.2.0/mpi/2021.2.0/bin//hydra_bstrap_proxy --upstream-host nrm095.hpc.itc.rwth-aachen.de --upstream-port 44829 --pgid 0 --launcher ssh --launcher-number 0 --base-path /cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/skylake_avx512/software/impi/2021.2.0-intel-compilers-2021.2.0/mpi/2021.2.0/bin/ --tree-width 16 --tree-level 1 --time-left -1 --launch-type 2 --debug --proxy-id 0 --node-id 0 --subtree-size 1 --upstream-fd 7 /cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/skylake_avx512/software/impi/2021.2.0-intel-compilers-2021.2.0/mpi/2021.2.0/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9

daRecall avatar Jun 09 '22 09:06 daRecall

Looking at https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/bug-mpiexec-segmentation-fault/m-p/1183364, you can influence this with

I_MPI_HYDRA_TOPOLIB=ipl

(Ha, look who is posting the last comment in that link)

ocaisa avatar Jun 09 '22 09:06 ocaisa

using impi 2021.6.0, everything is working:

$> mpiexec.hydra --version
Intel(R) MPI Library for Linux* OS, Version 2021.6 Build 20220227 (id: 28877f3f32)
Copyright 2003-2022, Intel Corporation.

$> mpiexec.hydra -np 2 hostname
nrm095.hpc.itc.rwth-aachen.de
nrm095.hpc.itc.rwth-aachen.de

daRecall avatar Jun 09 '22 09:06 daRecall

@ocaisa that doesn't change anything, looping around the same function

daRecall avatar Jun 09 '22 09:06 daRecall

ahh, yes, I forgot, we have had an issue open with intel (case# 05472393) regarding the problem. Their first comment was as usual "are you trying the newest version?" I did not visit ISC this year, but some of my colleagues did and they talked to some intel guys directly. Outcome was, with RHEL 8.6 and newer, old intelmpi is not working anymore.

daRecall avatar Jun 09 '22 10:06 daRecall

Is there any chance that Red Hat will accept a bug report for the older IntelMPI versions not working? This would require a deeper understanding of what changes in the new kernel does that breaks IntelMPI, so documenting the bug might be a challenge...

OleHolmNielsen avatar Jun 09 '22 10:06 OleHolmNielsen

@OleHolmNielsen kernel updates that break userspace are frwoned upon, so you can try to open a bugreport with redhat. they will at some point as you what they need, or point you to the release notes that say what has changed that broke this. they will probaby blame intel (and sounds like intel already fixed it, but doesn't want to backport it)

stdweird avatar Jun 09 '22 11:06 stdweird

@stdweird Yes, but how do we get any error messages from mpiexec.hydra which can be reported to Red Hat?

OleHolmNielsen avatar Jun 09 '22 11:06 OleHolmNielsen

@OleHolmNielsen the error you need to report is that an application is hanging since an upgrade to RHEL8.6 was done. you can already add what was said here (ie it works on 8.5, pstack points to the ipl thingie so they can have some idea in what direction to look).

stdweird avatar Jun 09 '22 11:06 stdweird

@stdweird Thanks for the info. I have made this test:

$ module load iimpi/2021b $ module list Currently Loaded Modules:

  1. GCCcore/11.2.0 5) numactl/2.0.14-GCCcore-11.2.0
  2. zlib/1.2.11-GCCcore-11.2.0 6) UCX/1.11.2-GCCcore-11.2.0
  3. binutils/2.37-GCCcore-11.2.0 7) impi/2021.4.0-intel-compilers-2021.4.0
  4. intel-compilers/2021.4.0 8) iimpi/2021b $ which mpiexec /home/modules/software/impi/2021.4.0-intel-compilers-2021.4.0/mpi/2021.4.0/bin/mpiexec $ mpiexec.hydra --version

Now I can execute pstack on the process PID:

$ pstack 717906 #0 0x000000000045009a in ipl_get_exclude_mask (str=, mask=, maxcpu=) at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_processor.c:529 #1 IPL_init_numa_nodes (ncpu=26320320, n_avail_cpu=1) at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_processor.c:715 #2 0x000000000044b347 in ipl_detect_machine_topology () at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_processor.c:1614 #3 0x0000000000449bc8 in ipl_processor_info (info=0x1919dc0, pid=0x1, detect_platform_only=26320320) at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_processor.c:1908 #4 0x000000000044c282 in ipl_entrance (detect_platform_only=26320320) at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_main.c:38 #5 0x000000000041b958 in i_set_core_and_thread_count () at ../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec_params.h:284 #6 mpiexec_get_parameters (t_argv=0x1919dc0) at ../../../../../src/pm/i_hydra/mpiexec/mpiexec_params.c:1266 #7 0x00000000004049fb in main (argc=26320320, argv=0x1) at ../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1763

Do we agree that this is the issue which I should report to Red Hat?

Thanks, Ole

OleHolmNielsen avatar Jun 09 '22 12:06 OleHolmNielsen

@OleHolmNielsen the issue to report to RHEL is that your application is hanging after an upgrade. RH has no knoweldge about intel mpi itself (and they will most likely not provide a solution, only an explanation)

stdweird avatar Jun 09 '22 12:06 stdweird

I have created an issue in the Red Hat Bugzilla: [Bug 2095281] New: Intel MPI mpiexec.hydra hangs after upgrade to RHEL 8.6 This bug is unfortunately not accessible to others because it relates to the kernel.

OleHolmNielsen avatar Jun 09 '22 12:06 OleHolmNielsen

AFAIK, you can add anyone (with their email) to the report, so that they can also read it...

truatpasteurdotfr avatar Jun 09 '22 14:06 truatpasteurdotfr

I anyone would like their E-mail to be added to the Red Hat bug 2095281 you can ask me to do it.

OleHolmNielsen avatar Jun 10 '22 05:06 OleHolmNielsen

If there's a regression in the RHEL kernel topology information, you may want to compare the output of lstopo before and after the upgrade.

bgoglin avatar Jun 10 '22 07:06 bgoglin

@bgoglin I took an EL85 node and copied the output of lstopo to a file. Then I upgraded the node to EL86 and rebooted. The EL86 lstopo output is 100% identical to that of EL85.

OleHolmNielsen avatar Jun 13 '22 06:06 OleHolmNielsen

The Intel MPI Release Notes at https://www.intel.com/content/www/us/en/developer/articles/release-notes/mpi-library-release-notes-linux.html don't mention any bugs related to mpixec.hydra, there's only a terse "Bug fixes" line.
I have not been able to locate the mentioned intel case# 05472393. It would seem that going forward with EL8.6, we can no longer use the older Intel MPI libraries prior to 2021.6. So much for all the EasyBuild modules based on intel toolchains which we have already installed :-(

OleHolmNielsen avatar Jun 13 '22 10:06 OleHolmNielsen

I received a response in Red Hat bug 2095281:

I agree that it looks like the kernel should be blamed too, but
this is not necessarily true.

Finally. In any case the application is buggy. It should not spin in the
infinite loop anyway. According to pstack it doesn't hang in syscall. And
this is what we need to investigate first, imo. Until then it is absolutely
unclear how can we find the root of the problem, if _if_ the kernel is wrong.

In short. IMO, this is user-space bug no matter what.

So the conclusion is that Intel MPI prior to 2021.6 is buggy. We cannot use older Intel MPI versions on EL 8.6 kernels then :-(

If no workaround is found, it seems that all EB modules iimpi/* prior to 2021.6 have to be discarded after we upgrade from EL 8.5 to 8.6.

OleHolmNielsen avatar Jun 17 '22 06:06 OleHolmNielsen

Or the impi in the installed iimpi and intel toolchains is updated in place to 2021.6 (not happy with that workaround, but I see no better alternative).

boegel avatar Jun 17 '22 15:06 boegel

Should only be done on a per-site initiative I think.

akesandgren avatar Jun 17 '22 16:06 akesandgren

For the record: When I load the module iimpi/2021b on an EL 8.6 node running kernel 4.18.0-372.9.1.el8.x86_64, the mpiexec.hydra enters an infinite loop while reading /sys/devices/system/node/node0/cpulist as seen by strace:

$ strace -f -e file mpiexec.hydra --version (many lines deleted) openat(AT_FDCWD, "/sys/devices/system/cpu", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3 openat(-1, "/sys/devices/system/cpu/possible", O_RDONLY) = 3 openat(AT_FDCWD, "/sys/devices/system/node/node0/cpulist", O_RDONLY) = 3 (Now I type Ctrl-C) ^C--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} --- strace: Process 4960 detached

After rebooting the node with the EL 8.5 kernel 4.18.0-348.23.1.el8_5.x86_64 the mpiexec.hydra works correctly.

I've now built the EB module iimpi/2022.05 which contains the latest Intel MPI module:

$ ml Currently Loaded Modules:

  1. GCCcore/11.3.0 5) numactl/2.0.14-GCCcore-11.3.0
  2. zlib/1.2.12-GCCcore-11.3.0 6) UCX/1.12.1-GCCcore-11.3.0
  3. binutils/2.38-GCCcore-11.3.0 7) impi/2021.6.0-intel-compilers-2022.1.0
  4. intel-compilers/2022.1.0 8) iimpi/2022.05

Running this module on the EL 8.6 node running kernel 4.18.0-372.9.1.el8.x86_64 the mpiexec.hydra works correctly (as observed by others):

$ mpiexec.hydra --version Intel(R) MPI Library for Linux* OS, Version 2021.6 Build 20220227 (id: 28877f3f32) Copyright 2003-2022, Intel Corporation.

OleHolmNielsen avatar Jun 20 '22 09:06 OleHolmNielsen

One additional information is about the Intel MKL library: I've built the latest EB module imkl/2022.1.0 which includes an HPL benchmark executable .../modules/software/imkl/2022.1.0/mkl/2022.1.0/benchmarks/linpack/xlinpack_xeon64

Running the MKL2022.1.0 xlinpack_xeon64 executable also results in multiple copies of mpiexec.hydra in infinite loops, just like with Intel MPI prior to 2021.6.

I think there exists a newer MKL 2022.2.0 but I don't know how to make en EB module with it for testing - can anyone help?

OleHolmNielsen avatar Jun 21 '22 13:06 OleHolmNielsen

I think there exists a newer MKL 2022.2.0 but I don't know how to make en EB module with it for testing - can anyone help?

I see 2022.1.0 on https://www.intel.com/content/www/us/en/developer/articles/tool/oneapi-standalone-components.html#inpage-nav-9-7

This is the easyconfig that you've tested: https://github.com/easybuilders/easybuild-easyconfigs/blob/develop/easybuild/easyconfigs/i/imkl/imkl-2022.1.0.eb - to update it you would change:

source_urls = ['https://registrationcenter-download.intel.com/akdlm/irc_nas/18721/']
sources = ['l_onemkl_p_%(version)s.223_offline.sh']

with the relevant source url and source for the offline Linux installer.

branfosj avatar Jun 21 '22 13:06 branfosj

I have built the intel/2022a toolchain with EB 4.6.0, and I can confirm that with the new module impi/2021.6.0-intel-compilers-2022.1.0 the above issue with all previous Intel MPI versions has been resolved:

$ module load impi/2021.6.0-intel-compilers-2022.1.0
$ mpiexec.hydra --version
Intel(R) MPI Library for Linux* OS, Version 2021.6 Build 20220227 (id: 28877f3f32)
Copyright 2003-2022, Intel Corporation.

Of course, we still face an issue with all software modules that use the Intel MPI module prior to 2021.6.0 being broken on EL8 systems running the latest kernel.

OleHolmNielsen avatar Jul 11 '22 08:07 OleHolmNielsen

We got some feedback from intel:

The issue was analyzed and the root cause was found. In RHEL8.6 and other OS with recent kernel versions, system files are reported to have 0 bytes size. In previous kernel versions ftell was reporting size == blocksize != 0.

Using size==0 lead to a memory leak with the known consequences.

I have written a small workaround library that can be used with LD_PRELOAD. This lib will use an "adapted" version of ftell for the startup of IMPI. Once the program is started there should be no issue. It is also possible to switch off LD_PRELOAD for the user mpi program.

If this form of workaround is acceptable and you are willing to test it I can attach it to this issue.

Preferred methodology is, however, to use the newest version of IMPI.

daRecall avatar Aug 08 '22 11:08 daRecall