rcps-buildscripts
rcps-buildscripts copied to clipboard
Install Request: VASP GPU version built with Cray PE
For EPSRC.
The CRAY PE is short-term available in a Singularity container on Myriad for testing.
# helps set up Singularity env appropriately
module load singularity-env/1.0.0
singularity run /shared/ucl/apps/CrayPE/CrayPE
Try to build VASP!
If I look at the container, I see this stuff and craype modules are available in there:
singularity run /shared/ucl/apps/CrayPE/CrayPE
WARNING: Bind mount '/home/cceahke => /home/cceahke' overlaps container CWD /home/cceahke/vasp, may not be available
Running bash as a login shell ...
bash: /shared/ucl/apps/bin/defmods: No such file or directory
ERROR: Unable to locate a modulefile for 'ops-tools'
ERROR: Unable to locate a modulefile for 'htop'
Singularity> module list
Currently Loaded Modulefiles:
1) gcc-libs/4.9.2 8) screen/4.8.0-ucl1 15) tmux/3.2a 22) ops-tools/2.0.0 29) cce/13.0.0(default)
2) cmake/3.21.1 9) gerun 16) mrxvt/0.5.4 23) htop/1.0.3/gnu-4.9.2 30) craype/2.7.13(default)
3) flex/2.5.39 10) nano/2.4.2 17) userscripts/1.4.0 24) singularity-env/1.0.0 31) cray-mpich/8.1.12(default)
4) git/2.32.0 11) nedit/5.6-aug15 18) rcps-core/1.0.0 25) craype-x86-skylake 32) cray-libsci/21.08.1.2(default)
5) apr/1.7.0 12) dos2unix/7.3 19) compilers/intel/2018/update3 26) libfabric/1.13.1(default) 33) PrgEnv-cray/8.1.0(default)
6) apr-util/1.6.1 13) giflib/5.1.1 20) mpi/intel/2018/update3/intel 27) craype-network-ofi
7) subversion/1.14.1 14) emacs/26.3 21) default-modules/2018 28) perftools-base/21.12.0(default)
Singularity> module avail craype
--------------------------------------------------------------- /opt/cray/pe/modulefiles ----------------------------------------------------------------
craype-dl-plugin-py3/21.02.1.3 craype/2.7.13(default)
---------------------------------------------------- /opt/cray/pe/craype-targets/default/modulefiles ----------------------------------------------------
craype-accel-amd-gfx908 craype-broadwell craype-network-ofi craype-x86-cascadelake craype-x86-milan craype-x86-skylake
craype-accel-host craype-network-infiniband craype-network-ucx craype-x86-icelake craype-x86-rome
Singularity> module show craype-accel-host
-------------------------------------------------------------------
/opt/cray/pe/craype-targets/default/modulefiles/craype-accel-host:
conflict craype-accel-nvidia35
conflict craype-accel-nvidia52
conflict craype-accel-nvidia60
conflict craype-accel-nvidia70
conflict craype-accel-nvidia80
conflict craype-accel-amd-gfx906
conflict craype-accel-amd-gfx908
conflict craype-accel-amd-gfx90a
append-path PE_PRODUCT_LIST CRAY_ACCEL
setenv CRAY_ACCEL_TARGET host
setenv CRAY_TCMALLOC_MEMFS_FORCE 1
setenv CRAYPE_LINK_TYPE dynamic
module-whatis {Sets options and paths required to build with cce for target=host. }
-------------------------------------------------------------------
craype-accel-nvidia80
doesn't seem to be there as a module, so I assume we want craype-accel-host
and to build on an A100 node.
Maybe.
We could also set CRAY_ACCEL_TARGET
ourselves.
To see our actual modules, and remove the errors above add
export SINGULARITY_BINDPATH=$SINGULARITY_BINDPATH,/shared/ucl/apps
(I am guessing we will need CUDA, not sure if this is going to work...)
Eeg, oh dear. (We also have no make
inside the container).
Singularity> cc pi.c -o pi
pi.c:1:10: fatal error: 'stdio.h' file not found
#include <stdio.h>
^~~~~~~~~
1 error generated.
We also don't have the libm.so etc versionless symlinks in /usr/lib64
, which I assume installing the right devtools would set up.
I've added:
dnf -y group install "Development Tools"
to the container and rebuilt it. A quick test shows that it can now compile and run a simple C program and running make should work. It has now been copied to:
/shared/ucl/apps/CrayPE/CrayPE
Confirmed, I now have a C version of pi that uses the Cray libraries:
Singularity> ldd pi
linux-vdso.so.1 (0x00007ffefcf4d000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f620bf6c000)
libfi.so.1 => /opt/cray/pe/cce/13.0.0/cce/x86_64/lib/libfi.so.1 (0x00007f620b661000)
libquadmath.so.0 => /opt/cray/pe/cce/13.0.0/cce/x86_64/lib/libquadmath.so.0 (0x00007f620b421000)
libmodules.so.1 => /opt/cray/pe/cce/13.0.0/cce/x86_64/lib/libmodules.so.1 (0x00007f620b205000)
libcraymath.so.1 => /opt/cray/pe/cce/13.0.0/cce/x86_64/lib/libcraymath.so.1 (0x00007f620af24000)
libf.so.1 => /opt/cray/pe/cce/13.0.0/cce/x86_64/lib/libf.so.1 (0x00007f620ac90000)
libu.so.1 => /opt/cray/pe/cce/13.0.0/cce/x86_64/lib/libu.so.1 (0x00007f620a97b000)
libcsup.so.1 => /opt/cray/pe/cce/13.0.0/cce/x86_64/lib/libcsup.so.1 (0x00007f620a775000)
libc.so.6 => /lib64/libc.so.6 (0x00007f620a3b0000)
/lib64/ld-linux-x86-64.so.2 (0x00007f620c170000)
libm.so.6 => /lib64/libm.so.6 (0x00007f620a02e000)
librt.so.1 => /lib64/librt.so.1 (0x00007f6209e26000)
libgfortran.so.5 => /opt/cray/pe/gcc-libs/libgfortran.so.5 (0x00007f6209978000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f6209758000)
libstdc++.so.6 => /opt/cray/pe/gcc-libs/libstdc++.so.6 (0x00007f6209342000)
libgcc_s.so.1 => /opt/cray/pe/gcc-libs/libgcc_s.so.1 (0x00007f6209129000)
And a Fortran pi.
ftn pi.f90 -o pi
ARCHER2 docs have a makefile for building VASP 6.3.0 with the CrayPE and GCC (not CCE or with GPUs): https://github.com/hpc-uk/build-instructions/blob/main/apps/VASP/build_vasp_6.3.0_ARCHER2_GCC.md
We can try that before we get any other makefiles.
I managed to compile serial and OpenMP stuff, and I managed to compile and run an mpi binary on one core but it's not clear what the equivalent of mpirun is (there's no mpirun/mpiexec/srun...).
@balston I think we're missing the optional nvidia modules.
craype-accel-host is for running OpenACC code on the host CPU (OpenACC is essentially turned into OpenMP during compilation), so probably not what you are hoping for.
Typically you need the nvidia accel modules in order to get the right automatic build options for OpenACC on Nvidia GPUs – [] my understanding is that the nvidia modules are optional during installation, and not installed by default, so this may be why they aren’t in the container.
I haven't had time to look into the missing nvidia modules today but will investigate tomorrow.
I think I probably need to set up the WSL-2 environment where the container is being built with GPU support as described here:
https://docs.nvidia.com/cuda/wsl-user-guide/index.html
before re-building the container for the Cray PE.
Looking at the Cray PE GNU build of VASP 6.3.0
Singularity> module unload PrgEnv-cray
Singularity> module load PrgEnv-gnu/8.1.0
Loading craype/2.7.13
ERROR: Conflicting 'PrgEnv-gnu/8.1.0' is loading
Loading PrgEnv-gnu/8.1.0
ERROR: Load of requirement 'craype' failed
Whut?
Singularity> module show PrgEnv-gnu/8.1.0
-------------------------------------------------------------------
/opt/cray/pe/modulefiles/PrgEnv-gnu/8.1.0:
conflict PrgEnv-amd
conflict PrgEnv-aocc
conflict PrgEnv-cray
conflict PrgEnv-gnu
conflict PrgEnv-intel
conflict PrgEnv-nvidia
prepend-path LD_LIBRARY_PATH /opt/cray/pe/lib64:/opt/cray/lib64
setenv PE_ENV GNU
setenv gcc_already_loaded 0
module load gcc
module switch cray-libsci cray-libsci/21.08.1.2
module switch cray-mpich cray-mpich/8.1.12
module load craype
module load cray-mpich
module load cray-libsci
setenv CRAY_PRGENVGNU loaded
-------------------------------------------------------------------
That module is doing rather a lot of scripting itself that I'd expect the module command to be taking care of for it.
The container is set up wrong again because in my real CentOS Stream release 8 VM with the Cray PE installed:
export CRAY_ENABLE_PE=/etc/cray-pe.d/enable-pe.sh
. $CRAY_ENABLE_PE
module list
Currently Loaded Modulefiles:
1) craype-x86-skylake 6) craype/2.7.13
2) libfabric/1.13.1 7) cray-mpich/8.1.12
3) craype-network-ofi 8) cray-libsci/21.08.1.2
4) perftools-base/21.12.0 9) PrgEnv-cray/8.1.0
5) cce/13.0.0
and then:
module unload PrgEnv-cray
module list
Currently Loaded Modulefiles:
1) craype-x86-skylake 3) craype-network-ofi
2) libfabric/1.13.1 4) perftools-base/21.12.0
module load PrgEnv-gnu/8.1.0
module list
Currently Loaded Modulefiles:
1) craype-x86-skylake 6) craype/2.7.13
2) libfabric/1.13.1 7) cray-mpich/8.1.12
3) craype-network-ofi 8) cray-libsci/21.08.1.2
4) perftools-base/21.12.0 9) PrgEnv-gnu/8.1.0
5) gcc/11.2.0
@balston What version of environment modules do you have in your VM? (Myriad is 3.2.6 and Young is 4.4.0 - we do have different versions in /shared/ucl/apps/modules
which are usable). module -h
should show it.
I saw a mention that the PrgEnv modules didn't work well at one point with newer versions of environment modules - not sure if still current.
Inside the current CrayPE container on Myriad, it says 4.5.2.
environment-modules-4.5.2-1.el8 is the default in CentOS 8 stream. Let me see if I can install an earlier version in the container.
Looks like 4.4.1 is available for CentOS 8 - I will try that version.
Hmm, but if you can load and unload the PrgEnv modules in your VM with 4.5.2, that shouldn't be the problem.
I'm trying to rebuild the container to try something out and it is failing to build. The CentOS 8 Stream Docker image is no longer working probably due to:
https://pythonspeed.com/articles/centos-8-is-dead/
I'm trying a different Docker image to start from.
Using CentOS 9 Steam as the base Docker image doesn't work as the Cray PE fails to install with dependency errors:
python3 cpe-installer.py --cpu-target x86-icelake
Adding repo from: file:///mnt/CrayPE
created by dnf config-manager from file:///mnt/CrayPE 7.4 MB/s | 281 kB 00:00
Error:
Problem: package cpe-apollo2000-base-content-21.12-21.12-09.x86_64 requires gdb4hpc-4.13.8 >= 20211201134032_f1d929e3-2.el8, but none of the providers can be installed
- conflicting requests
- nothing provides libpython3.6m.so.1.0()(64bit) needed by gdb4hpc-4.13.8-20211201134032_f1d929e3-2.el8.x86_64
(try to add '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages)
Error: subprocess returned non-zero exit status
FATAL: While performing build: while running engine: exit status 1
I will try using RedHat’s Universal Base Image as these are freely available - https://hub.docker.com/u/redhat
The RedHat UBI doesn't include the developer tools and doesn't allow you to install them without the system being registered.
Tried an alternative image (docker pull rockylinux) as the base and now I'm back to where we were. A container that runs but cannot do the above module unload/loads.
You can test if the module version is the problem by doing this inside the container on Myriad for any of the versions we have, eg:
export PATH=/shared/ucl/apps/modules/4.7.0/bin:$PATH
export MODULESHOME=/shared/ucl/apps/modules/4.7.0
Then you should get our one first instead of the container's
(as long as you bound /shared/ucl/apps
:
export SINGULARITY_BINDPATH=$SINGULARITY_BINDPATH,/shared/ucl/apps
or on the singularity run line).
Ian has found if you run it with singularity shell
instead, it uses a different (much older) version of the module command and all works...
singularity shell /shared/ucl/apps/CrayPE/CrayPE
Singularity> module --version
VERSION=3.2.11.5
DATE=2021-03-14
AUTOLOADPATH=undef
BASEPREFIX="/opt/cray/pe/modules"
BEGINENV=99
CACHE_AVAIL=undef
DEF_COLLATE_BY_NUMBER=undef
DOT_EXT=".ext"
EVAL_ALIAS=1
HAS_BOURNE_FUNCS=1
HAS_BOURNE_ALIAS=1
HAS_TCLXLIBS=undef
HAS_X11LIBS=undef
LMSPLIT_SIZE=99999
MODULEPATH="/opt/cray/pe/modulefiles:/opt/cray/modulefiles:/opt/modulefiles"
MODULES_INIT_DIR="/opt/cray/pe/modules/3.2.11.5/init"
PREFIX="/opt/cray/pe/modules/3.2.11.5"
TCL_VERSION="8.6"
TCL_PATCH_LEVEL="8.6.8"
TMP_DIR="/tmp"
USE_FREE=undef
VERSION_MAGIC=1
VERSIONPATH="/opt/cray/pe/modules/3.2.11.5"
WANTS_VERSIONING=1
WITH_DEBUG_INFO=undef
Then this works fine:
Singularity> module unload PrgEnv-cray
Singularity> module load PrgEnv-gnu
Gave this a quick try, and the VASP build process also depends on rsync
, which the container doesn't have installed.
I'm adding rsync
to the container.
I think I've sorted out how to enable modules correctly in the container on Myriad and for the:
Singularity> module unload PrgEnv-cray
Singularity> module load PrgEnv-gnu
to work. I used:
export MODULE_VERSION=3.2.6
export MODULE_VERSION_STACK=3.2.6
export MODULESHOME=/shared/ucl/apps/modules/3.2.6/Modules/3.2.6
export MODULEPATH=/shared/ucl/apps/modulefiles/core:/shared/ucl/apps/modulefiles/applications:/shared/ucl/apps/modulefiles/libraries:/shared/ucl/apps/modulefiles/compilers:/shared/ucl/apps/modulefiles/development:/shared/ucl/apps/modulefiles/bundles
module ()
{
eval `/shared/ucl/apps/modules/3.2.6/Modules/$MODULE_VERSION/bin/modulecmd bash $*`
}
source /etc/cray-pe.d/enable-pe.sh
Most of this was taken from: https://github.com/UCL-RITS/rcps-singularity-recipes/blob/master/xpra.def
I'm now going to put the above into the container definition and see if it still works.
A new version of the Cray PE container is in:
/shared/ucl/apps/CrayPE/
on Myriad. It can be run using:
export SINGULARITY_BINDPATH=$SINGULARITY_BINDPATH,/shared/ucl/apps
singularity shell /shared/ucl/apps/CrayPE/CrayPE
. /usr/local/bin/module-setup
It includes rsync so may be able to build the CPU version of VASP. The .def file used to build the container is in the same directory.