Yggdrasil icon indicating copy to clipboard operation
Yggdrasil copied to clipboard

[MPICH] Build for device ch4

Open simonbyrne opened this issue 2 years ago • 16 comments

cc @giordano

simonbyrne avatar Mar 07 '23 18:03 simonbyrne

Should we switch to ch4: https://github.com/JuliaParallel/MPI.jl/issues/720#issuecomment-1458606823?

giordano avatar Mar 07 '23 18:03 giordano

I guess? I don't really know what it is, but if that is what is being maintained we should switch.

simonbyrne avatar Mar 07 '23 18:03 simonbyrne

Uhm, ch4 is actually the default. We switched to ch3 in #3185, specifically in https://github.com/JuliaPackaging/Yggdrasil/commit/0104fa326a72bfc28c6376f665f4fe0f23471cec, but I can't really tell why :confused:

giordano avatar Mar 07 '23 20:03 giordano

Lets try it and see!

simonbyrne avatar Mar 07 '23 20:03 simonbyrne

Well, it turns out that compilation of ch4 with --enable-fast=all,O3 is horribly slow, some jobs are even timing out after 3 hours. In particular, compilation of src/mpi/coll/mpir_coll.c takes literally hours. The debug builds I did in https://github.com/JuliaParallel/MPI.jl/issues/720#issuecomment-1458832642 were pretty fast (~6 minutes) instead.

giordano avatar Mar 08 '23 00:03 giordano

And compilation for many platforms actually fails, in many different ways, including choking on broken assembly, triggering ICEs, not finding header files and so on. This looks quite broken.

giordano avatar Mar 08 '23 07:03 giordano

We copied --enable-fast=all,O3 from HomeBrew, but they also use --enable-g=dbg, which is also the option I used in https://github.com/JuliaParallel/MPI.jl/issues/720#issuecomment-1458832642, I wonder if that avoids compilation going completely haywire. We had some problems with debug builds in the past (see #4071 and #4039), but that specific problem should have been fixed by https://github.com/pmodels/mpich/pull/5720. Alternatively, we could try removing O3, and keep default optimisation O2, and see what happens in that case.

giordano avatar Mar 08 '23 11:03 giordano

Do you want to try building with a different compiler? Maybe this resolves the long build times.

eschnett avatar Mar 08 '23 18:03 eschnett

We're using three different compilers already, because of libgfortran (and for libgfortran 4 there isn't much choice, there's only gcc 7)

giordano avatar Mar 08 '23 18:03 giordano

Ch4 is too finicky, compilation goes wrong in too many ways. I'm going to merge #6357 instead and we'll play with compiler flags another time.

giordano avatar Mar 09 '23 01:03 giordano

Ok, as suggested in https://github.com/pmodels/mpich/issues/6434 --enable-fast=O3,ndebug --with-device=ch4 has a much more reasonable build time (we're still suffering from very slow downloads on CI, so the total duration of the jobs is often much longer than the build itself).

There are a couple of problems:

  • it doesn't build on FreeBSD because asm/types.h is missing, we might have to use ch3 on this platform
  • the size of libmpi increases 10x, from 5 MB to 50 MB

@simonbyrne @eschnett what do you suggest to do? Go with ch4 (except on FreeBSD) and take the better performance, or stay with ch3 and keep a much smaller library?

giordano avatar Mar 09 '23 23:03 giordano

I have no idea! I wonder what is the cause of the file size changes? Is it pulling in other libraries?

simonbyrne avatar Mar 10 '23 00:03 simonbyrne

I'm leaning towards not doing this: MPICH_jll can remain a simple package, if users want something more performant they probably want to use system MPI anyway. @vchuravy opinions?

giordano avatar Mar 22 '23 15:03 giordano

I am skewed towards providing a more performant default, so ch4 get's my vote

vchuravy avatar Mar 22 '23 16:03 vchuravy

I would install MPICH in the same way Debian, Ubuntu, or Homebrew install it. That's would be a reasonable expectation to have. Alternatively we can switch to using OpenMPI by default instead.

Ubuntu's MPICH recipe contains these lines:

	dh_auto_configure -- $(extra_flags) CPPFLAGS="" CFLAGS="" CXXFLAGS="" FFLAGS="$(FFLAGS)" FCFLAGS="$(FFLAGS)" BASH_SHELL=/bin/bash
	( cd src/pm/hydra && ./configure --with-hwloc-prefix=/usr $(DEVICE) FFLAGS="$(FFLAGS)"  --prefix=/usr )

so apparently there's a thing called hydra that is installed separately -- I don't know why. Apart from this Ubuntu doesn't use any special options and doesn't choose a device.

I would not provide a slow (what does this mean?) MPI implementation by default. I would also not start by expecting people to switch to a system MPI. If they have a laptop or a workstation, or even a small cluster, then Julia's default install should do something reasonable, good enough for interactive use and getting people hooked on Julia+MPI. In my mind this means that shared memory + Gigabit ethernet should work efficiently by default.

eschnett avatar Mar 22 '23 18:03 eschnett

so apparently there's a thing called hydra that is installed separately -- I don't know why. Apart from this Ubuntu doesn't use any special options and doesn't choose a device.

hydra is the MPICH launcher.

simonbyrne avatar Mar 23 '23 16:03 simonbyrne

#10249 switched to ch4

vchuravy avatar Feb 28 '25 19:02 vchuravy