rcps-buildscripts icon indicating copy to clipboard operation
rcps-buildscripts copied to clipboard

Install Request: VASP GPU version built with Cray PE

Open heatherkellyucl opened this issue 2 years ago • 29 comments

For EPSRC.

The CRAY PE is short-term available in a Singularity container on Myriad for testing.

# helps set up Singularity env appropriately
module load singularity-env/1.0.0

singularity run /shared/ucl/apps/CrayPE/CrayPE

Try to build VASP!

heatherkellyucl avatar Mar 28 '22 16:03 heatherkellyucl

If I look at the container, I see this stuff and craype modules are available in there:

singularity run /shared/ucl/apps/CrayPE/CrayPE
WARNING: Bind mount '/home/cceahke => /home/cceahke' overlaps container CWD /home/cceahke/vasp, may not be available
Running bash as a login shell ...
bash: /shared/ucl/apps/bin/defmods: No such file or directory
ERROR: Unable to locate a modulefile for 'ops-tools'
ERROR: Unable to locate a modulefile for 'htop'
Singularity> module list
Currently Loaded Modulefiles:
 1) gcc-libs/4.9.2      8) screen/4.8.0-ucl1  15) tmux/3.2a                     22) ops-tools/2.0.0                  29) cce/13.0.0(default)             
 2) cmake/3.21.1        9) gerun              16) mrxvt/0.5.4                   23) htop/1.0.3/gnu-4.9.2             30) craype/2.7.13(default)          
 3) flex/2.5.39        10) nano/2.4.2         17) userscripts/1.4.0             24) singularity-env/1.0.0            31) cray-mpich/8.1.12(default)      
 4) git/2.32.0         11) nedit/5.6-aug15    18) rcps-core/1.0.0               25) craype-x86-skylake               32) cray-libsci/21.08.1.2(default)  
 5) apr/1.7.0          12) dos2unix/7.3       19) compilers/intel/2018/update3  26) libfabric/1.13.1(default)        33) PrgEnv-cray/8.1.0(default)      
 6) apr-util/1.6.1     13) giflib/5.1.1       20) mpi/intel/2018/update3/intel  27) craype-network-ofi               
 7) subversion/1.14.1  14) emacs/26.3         21) default-modules/2018          28) perftools-base/21.12.0(default)  
Singularity> module avail craype
--------------------------------------------------------------- /opt/cray/pe/modulefiles ----------------------------------------------------------------
craype-dl-plugin-py3/21.02.1.3  craype/2.7.13(default)  

---------------------------------------------------- /opt/cray/pe/craype-targets/default/modulefiles ----------------------------------------------------
craype-accel-amd-gfx908  craype-broadwell           craype-network-ofi  craype-x86-cascadelake  craype-x86-milan  craype-x86-skylake  
craype-accel-host        craype-network-infiniband  craype-network-ucx  craype-x86-icelake      craype-x86-rome    
Singularity> module show craype-accel-host
-------------------------------------------------------------------
/opt/cray/pe/craype-targets/default/modulefiles/craype-accel-host:

conflict        craype-accel-nvidia35
conflict        craype-accel-nvidia52
conflict        craype-accel-nvidia60
conflict        craype-accel-nvidia70
conflict        craype-accel-nvidia80
conflict        craype-accel-amd-gfx906
conflict        craype-accel-amd-gfx908
conflict        craype-accel-amd-gfx90a
append-path     PE_PRODUCT_LIST CRAY_ACCEL
setenv          CRAY_ACCEL_TARGET host
setenv          CRAY_TCMALLOC_MEMFS_FORCE 1
setenv          CRAYPE_LINK_TYPE dynamic
module-whatis   {Sets options and paths required to build with cce for target=host. }
-------------------------------------------------------------------

craype-accel-nvidia80 doesn't seem to be there as a module, so I assume we want craype-accel-host and to build on an A100 node.

Maybe.

We could also set CRAY_ACCEL_TARGET ourselves.

heatherkellyucl avatar Mar 28 '22 16:03 heatherkellyucl

To see our actual modules, and remove the errors above add

export SINGULARITY_BINDPATH=$SINGULARITY_BINDPATH,/shared/ucl/apps

(I am guessing we will need CUDA, not sure if this is going to work...)

heatherkellyucl avatar Mar 29 '22 08:03 heatherkellyucl

Eeg, oh dear. (We also have no make inside the container).

Singularity> cc pi.c -o pi
pi.c:1:10: fatal error: 'stdio.h' file not found
#include <stdio.h>
         ^~~~~~~~~
1 error generated.

We also don't have the libm.so etc versionless symlinks in /usr/lib64, which I assume installing the right devtools would set up.

heatherkellyucl avatar Mar 29 '22 09:03 heatherkellyucl

I've added:

dnf -y group install "Development Tools"

to the container and rebuilt it. A quick test shows that it can now compile and run a simple C program and running make should work. It has now been copied to:

/shared/ucl/apps/CrayPE/CrayPE

balston avatar Mar 30 '22 16:03 balston

Confirmed, I now have a C version of pi that uses the Cray libraries:

Singularity> ldd pi
        linux-vdso.so.1 (0x00007ffefcf4d000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f620bf6c000)
        libfi.so.1 => /opt/cray/pe/cce/13.0.0/cce/x86_64/lib/libfi.so.1 (0x00007f620b661000)
        libquadmath.so.0 => /opt/cray/pe/cce/13.0.0/cce/x86_64/lib/libquadmath.so.0 (0x00007f620b421000)
        libmodules.so.1 => /opt/cray/pe/cce/13.0.0/cce/x86_64/lib/libmodules.so.1 (0x00007f620b205000)
        libcraymath.so.1 => /opt/cray/pe/cce/13.0.0/cce/x86_64/lib/libcraymath.so.1 (0x00007f620af24000)
        libf.so.1 => /opt/cray/pe/cce/13.0.0/cce/x86_64/lib/libf.so.1 (0x00007f620ac90000)
        libu.so.1 => /opt/cray/pe/cce/13.0.0/cce/x86_64/lib/libu.so.1 (0x00007f620a97b000)
        libcsup.so.1 => /opt/cray/pe/cce/13.0.0/cce/x86_64/lib/libcsup.so.1 (0x00007f620a775000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f620a3b0000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f620c170000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f620a02e000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f6209e26000)
        libgfortran.so.5 => /opt/cray/pe/gcc-libs/libgfortran.so.5 (0x00007f6209978000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f6209758000)
        libstdc++.so.6 => /opt/cray/pe/gcc-libs/libstdc++.so.6 (0x00007f6209342000)
        libgcc_s.so.1 => /opt/cray/pe/gcc-libs/libgcc_s.so.1 (0x00007f6209129000)

heatherkellyucl avatar Mar 30 '22 16:03 heatherkellyucl

And a Fortran pi.

ftn pi.f90 -o pi

heatherkellyucl avatar Mar 30 '22 16:03 heatherkellyucl

ARCHER2 docs have a makefile for building VASP 6.3.0 with the CrayPE and GCC (not CCE or with GPUs): https://github.com/hpc-uk/build-instructions/blob/main/apps/VASP/build_vasp_6.3.0_ARCHER2_GCC.md

We can try that before we get any other makefiles.

heatherkellyucl avatar Apr 01 '22 11:04 heatherkellyucl

I managed to compile serial and OpenMP stuff, and I managed to compile and run an mpi binary on one core but it's not clear what the equivalent of mpirun is (there's no mpirun/mpiexec/srun...).

owainkenwayucl avatar Apr 01 '22 13:04 owainkenwayucl

@balston I think we're missing the optional nvidia modules.

craype-accel-host is for running OpenACC code on the host CPU (OpenACC is essentially turned into OpenMP during compilation), so probably not what you are hoping for.

Typically you need the nvidia accel modules in order to get the right automatic build options for OpenACC on Nvidia GPUs – [] my understanding is that the nvidia modules are optional during installation, and not installed by default, so this may be why they aren’t in the container.

heatherkellyucl avatar Apr 04 '22 15:04 heatherkellyucl

I haven't had time to look into the missing nvidia modules today but will investigate tomorrow.

balston avatar Apr 05 '22 15:04 balston

I think I probably need to set up the WSL-2 environment where the container is being built with GPU support as described here:

https://docs.nvidia.com/cuda/wsl-user-guide/index.html

before re-building the container for the Cray PE.

balston avatar Apr 08 '22 12:04 balston

Looking at the Cray PE GNU build of VASP 6.3.0

Singularity> module unload PrgEnv-cray
Singularity> module load PrgEnv-gnu/8.1.0
Loading craype/2.7.13
  ERROR: Conflicting 'PrgEnv-gnu/8.1.0' is loading

Loading PrgEnv-gnu/8.1.0
  ERROR: Load of requirement 'craype' failed

Whut?

Singularity> module show PrgEnv-gnu/8.1.0
-------------------------------------------------------------------
/opt/cray/pe/modulefiles/PrgEnv-gnu/8.1.0:

conflict        PrgEnv-amd
conflict        PrgEnv-aocc
conflict        PrgEnv-cray
conflict        PrgEnv-gnu
conflict        PrgEnv-intel
conflict        PrgEnv-nvidia
prepend-path    LD_LIBRARY_PATH /opt/cray/pe/lib64:/opt/cray/lib64
setenv          PE_ENV GNU
setenv          gcc_already_loaded 0
module          load gcc
module          switch cray-libsci cray-libsci/21.08.1.2
module          switch cray-mpich cray-mpich/8.1.12
module          load craype
module          load cray-mpich
module          load cray-libsci
setenv          CRAY_PRGENVGNU loaded
-------------------------------------------------------------------

heatherkellyucl avatar Apr 08 '22 15:04 heatherkellyucl

That module is doing rather a lot of scripting itself that I'd expect the module command to be taking care of for it.

heatherkellyucl avatar Apr 08 '22 16:04 heatherkellyucl

The container is set up wrong again because in my real CentOS Stream release 8 VM with the Cray PE installed:

export CRAY_ENABLE_PE=/etc/cray-pe.d/enable-pe.sh
. $CRAY_ENABLE_PE
module list
Currently Loaded Modulefiles:
  1) craype-x86-skylake       6) craype/2.7.13
  2) libfabric/1.13.1         7) cray-mpich/8.1.12
  3) craype-network-ofi       8) cray-libsci/21.08.1.2
  4) perftools-base/21.12.0   9) PrgEnv-cray/8.1.0
  5) cce/13.0.0

and then:

module unload PrgEnv-cray
module list
Currently Loaded Modulefiles:
  1) craype-x86-skylake       3) craype-network-ofi
  2) libfabric/1.13.1         4) perftools-base/21.12.0
module load PrgEnv-gnu/8.1.0
module list
Currently Loaded Modulefiles:
  1) craype-x86-skylake       6) craype/2.7.13
  2) libfabric/1.13.1         7) cray-mpich/8.1.12
  3) craype-network-ofi       8) cray-libsci/21.08.1.2
  4) perftools-base/21.12.0   9) PrgEnv-gnu/8.1.0
  5) gcc/11.2.0

balston avatar Apr 12 '22 13:04 balston

@balston What version of environment modules do you have in your VM? (Myriad is 3.2.6 and Young is 4.4.0 - we do have different versions in /shared/ucl/apps/modules which are usable). module -h should show it.

I saw a mention that the PrgEnv modules didn't work well at one point with newer versions of environment modules - not sure if still current.

heatherkellyucl avatar Apr 12 '22 14:04 heatherkellyucl

Inside the current CrayPE container on Myriad, it says 4.5.2.

heatherkellyucl avatar Apr 12 '22 14:04 heatherkellyucl

environment-modules-4.5.2-1.el8 is the default in CentOS 8 stream. Let me see if I can install an earlier version in the container.

balston avatar Apr 12 '22 14:04 balston

Looks like 4.4.1 is available for CentOS 8 - I will try that version.

balston avatar Apr 12 '22 14:04 balston

Hmm, but if you can load and unload the PrgEnv modules in your VM with 4.5.2, that shouldn't be the problem.

heatherkellyucl avatar Apr 12 '22 14:04 heatherkellyucl

I'm trying to rebuild the container to try something out and it is failing to build. The CentOS 8 Stream Docker image is no longer working probably due to:

https://pythonspeed.com/articles/centos-8-is-dead/

I'm trying a different Docker image to start from.

balston avatar Apr 12 '22 15:04 balston

Using CentOS 9 Steam as the base Docker image doesn't work as the Cray PE fails to install with dependency errors:

python3 cpe-installer.py --cpu-target x86-icelake
Adding repo from: file:///mnt/CrayPE
created by dnf config-manager from file:///mnt/CrayPE                                   7.4 MB/s | 281 kB     00:00
Error:
 Problem: package cpe-apollo2000-base-content-21.12-21.12-09.x86_64 requires gdb4hpc-4.13.8 >= 20211201134032_f1d929e3-2.el8, but none of the providers can be installed
  - conflicting requests
  - nothing provides libpython3.6m.so.1.0()(64bit) needed by gdb4hpc-4.13.8-20211201134032_f1d929e3-2.el8.x86_64
(try to add '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages)
Error: subprocess returned non-zero exit status
FATAL:   While performing build: while running engine: exit status 1

I will try using RedHat’s Universal Base Image as these are freely available - https://hub.docker.com/u/redhat

balston avatar Apr 12 '22 15:04 balston

The RedHat UBI doesn't include the developer tools and doesn't allow you to install them without the system being registered.

balston avatar Apr 12 '22 15:04 balston

Tried an alternative image (docker pull rockylinux) as the base and now I'm back to where we were. A container that runs but cannot do the above module unload/loads.

balston avatar Apr 12 '22 16:04 balston

You can test if the module version is the problem by doing this inside the container on Myriad for any of the versions we have, eg:

export PATH=/shared/ucl/apps/modules/4.7.0/bin:$PATH
export MODULESHOME=/shared/ucl/apps/modules/4.7.0

Then you should get our one first instead of the container's (as long as you bound /shared/ucl/apps: export SINGULARITY_BINDPATH=$SINGULARITY_BINDPATH,/shared/ucl/apps or on the singularity run line).

heatherkellyucl avatar Apr 12 '22 16:04 heatherkellyucl

Ian has found if you run it with singularity shell instead, it uses a different (much older) version of the module command and all works...

singularity shell /shared/ucl/apps/CrayPE/CrayPE

Singularity> module --version
VERSION=3.2.11.5
DATE=2021-03-14

AUTOLOADPATH=undef
BASEPREFIX="/opt/cray/pe/modules"
BEGINENV=99  
CACHE_AVAIL=undef
DEF_COLLATE_BY_NUMBER=undef
DOT_EXT=".ext"
EVAL_ALIAS=1
HAS_BOURNE_FUNCS=1
HAS_BOURNE_ALIAS=1
HAS_TCLXLIBS=undef
HAS_X11LIBS=undef
LMSPLIT_SIZE=99999
MODULEPATH="/opt/cray/pe/modulefiles:/opt/cray/modulefiles:/opt/modulefiles"
MODULES_INIT_DIR="/opt/cray/pe/modules/3.2.11.5/init"
PREFIX="/opt/cray/pe/modules/3.2.11.5"
TCL_VERSION="8.6"
TCL_PATCH_LEVEL="8.6.8"
TMP_DIR="/tmp"
USE_FREE=undef
VERSION_MAGIC=1
VERSIONPATH="/opt/cray/pe/modules/3.2.11.5"
WANTS_VERSIONING=1
WITH_DEBUG_INFO=undef

Then this works fine:

Singularity> module unload PrgEnv-cray
Singularity> module load PrgEnv-gnu

heatherkellyucl avatar Apr 22 '22 11:04 heatherkellyucl

Gave this a quick try, and the VASP build process also depends on rsync, which the container doesn't have installed.

ikirker avatar May 04 '22 19:05 ikirker

I'm adding rsync to the container.

balston avatar May 05 '22 13:05 balston

I think I've sorted out how to enable modules correctly in the container on Myriad and for the:

Singularity> module unload PrgEnv-cray
Singularity> module load PrgEnv-gnu

to work. I used:

export MODULE_VERSION=3.2.6
export MODULE_VERSION_STACK=3.2.6
export MODULESHOME=/shared/ucl/apps/modules/3.2.6/Modules/3.2.6
export MODULEPATH=/shared/ucl/apps/modulefiles/core:/shared/ucl/apps/modulefiles/applications:/shared/ucl/apps/modulefiles/libraries:/shared/ucl/apps/modulefiles/compilers:/shared/ucl/apps/modulefiles/development:/shared/ucl/apps/modulefiles/bundles

module ()
{
    eval `/shared/ucl/apps/modules/3.2.6/Modules/$MODULE_VERSION/bin/modulecmd bash $*`
}

source /etc/cray-pe.d/enable-pe.sh

Most of this was taken from: https://github.com/UCL-RITS/rcps-singularity-recipes/blob/master/xpra.def

I'm now going to put the above into the container definition and see if it still works.

balston avatar May 09 '22 13:05 balston

A new version of the Cray PE container is in:

/shared/ucl/apps/CrayPE/

on Myriad. It can be run using:

export SINGULARITY_BINDPATH=$SINGULARITY_BINDPATH,/shared/ucl/apps
singularity shell /shared/ucl/apps/CrayPE/CrayPE
. /usr/local/bin/module-setup

It includes rsync so may be able to build the CPU version of VASP. The .def file used to build the container is in the same directory.

balston avatar Jun 06 '22 16:06 balston