rcps-buildscripts Install Request: Quantum Espresso 7.3 GPU and CPU variants

IN:06165073

Recently on the Quantum Espresso mailing list a group posted impressive performance with the GPU version of the software.

They used the exact same GPUs that are available on the Young Cluster. Would it be possible for you to compile the GPU enabled 7.2 version of the software and to make it available via module load?

Spack 0.20 has 7.1 with cuda variant available. (Might be a straightforward update to get it to build 7.2, might not).

Sep 01 '23 14:09 heatherkellyucl

Kai and myself have been helping a user on Young [IN06562363] get a working GPU build of the latest QUANTUM Espresso and we also have a Myriad user wanting it [IN06570525]. As we have had to build it ourselves to know how to make it work, it makes sense to install this as a central install on both clusters.

Apr 16 '24 15:04 balston

Current latest version is 7.3.1 so updated the title.

Apr 16 '24 15:04 balston

My build which has been done on Young is currently running part of the test suite in and interactive session on a Young GPU node with 1 GPU and 4 MPI procs. I'm running them using:

module unload compilers mpi gcc-libs
module load gcc-libs/10.2.0
module load compilers/nvidia/hpc-sdk/22.9

# To allow the test suite to run
module load python3/recommended

cd /qe-7.3.1-GitHub/test-suite
make run-tests-pw NPROCS=4 2>&1  | tee ../../run-tests-pw.log

Apr 16 '24 15:04 balston

I'm currently using the following to build QUANTUM Espresso 7.3.1 on Young. Note the build must be done on a GPU node and not on the login nodes:

module unload compilers mpi gcc-libs
module load gcc-libs/10.2.0
module load compilers/nvidia/hpc-sdk/22.9

cd ./qe-7.3.1-GitHub
./configure --prefix=XXX/quantum-espresso/7.3.1  --with-cuda=/shared/ucl/apps/nvhpc/2022_221/Linux_x86_64/22.1/cuda  --with-cuda-runtime=11.7 --with-cuda-cc=80 --enable-openmp --with-cuda-mpi=yes
make all
make install

Apr 16 '24 16:04 balston

The test subset I was running has finally finished:

All done. ERROR: only 244 out of 246 tests passed (1 skipped).
Failed tests in:
        /lustre/scratch/ccaabaa/Software/QuantumEspresso/qe-7.3.1-GitHub/test-suite/pw_workflow_exx_nscf/
Skipped test in:
        /lustre/scratch/ccaabaa/Software/QuantumEspresso/qe-7.3.1-GitHub/test-suite/pw_metaGGA/
make: *** [run-tests-pw] Error 1

One failed test which needs to be investigated.

Apr 16 '24 16:04 balston

I now have a build script for installing into

/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/

running this on Young,

Apr 17 '24 14:04 balston

Build finished without errors so I have a job running as ccspapp to run the test suite on a GPU node using:


#$ -pe mpi 4
#$ -l gpu=1

module unload compilers mpi gcc-libs
module load gcc-libs/10.2.0
module load compilers/nvidia/hpc-sdk/22.9

# To allow the test suite to run
module load python3/recommended

export PATH=${ESPRESSO_ROOT}/bin:$PATH
cd $ESPRESSO_ROOT/test-suite
make run-tests NPROCS=$NSLOTS 2>&1 | tee /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log

Job is:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
1343927 3.50000 QE-7.3.1_G ccspapp      r     04/17/2024 17:03:51 [email protected]     4

Apr 17 '24 16:04 balston

Job running the test suite finished overnight. The following tests failed:

pw_workflow_exx_nscf - uspp-k-restart-1.in (arg(s): 1): **FAILED**.
Different sets of data extracted from benchmark and test.
    Data only in benchmark: ef1, n1, band, e1.

pw_workflow_exx_nscf - uspp-k-restart-2.in (arg(s): 2): **FAILED**.
Different sets of data extracted from benchmark and test.
    Data only in benchmark: ef1, n1, band, e1.

All done. ERROR: only 244 out of 246 tests passed (1 skipped).
Failed tests in:
        /lustre/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/q-e/test-suite/pw_workflow_exx_nscf/
Skipped test in:
        /lustre/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/q-e/test-suite/pw_metaGGA/
make: *** [run-tests-pw] Error 1

unfortunately the pw tests failing stopped other tests from starting. Investigating ...

Apr 18 '24 08:04 balston

I'm getting:

     GPU acceleration is ACTIVE.  1 visible GPUs per MPI rank
     GPU-aware MPI enabled

     Message from routine print_cuda_info:
     High GPU oversubscription detected. Are you sure this is what you want?

for the failed tests.

Apr 18 '24 11:04 balston

I successfully ran the failed tests with 2 GPUs so modified the full test job to use 2 GPUs and resubmitted it.

Apr 18 '24 16:04 balston

So the job to run the test suite runs the following tests:

cd $ESPRESSO_ROOT/test-suite

# Run all the default set of tests - pw, cp, ph, epw, hp, tddfpt, kcw

make run-tests-pw NPROCS=$NSLOTS 2>&1 | tee /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
make run-tests-cp NPROCS=$NSLOTS 2>&1 | tee -a /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
make run-tests-ph NPROCS=$NSLOTS 2>&1 | tee -a /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
make run-tests-epw NPROCS=$NSLOTS 2>&1 | tee -a /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
make run-tests-hp NPROCS=$NSLOTS 2>&1 | tee -a /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
make run-tests-tddfpt NPROCS=$NSLOTS 2>&1 | tee -a /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
make run-tests-kcw NPROCS=$NSLOTS 2>&1 | tee -a /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log

the job took just over two hours to run. All the pw tests passed but some of the other tests failed and will need to be investigated. The log of the tests has been copied to here:

/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/q-e/run-tests.log-18042024

I have submitted a longer example job running the pw.x command:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
1345043 0.00000 QE-7.3.1_G ccaabaa      qw    04/19/2024 10:12:43                                    8

which requests 4 GPUs and 8 MPI processes.

Apr 19 '24 09:04 balston

My example job has run successfully so I'm going to make a module for this version and make it available on Young.

Apr 19 '24 11:04 balston

The module file is done and I've submitted a job to test that the module is correctly set up. Will check on Monday.

Apr 19 '24 16:04 balston

Test job worked with the module file so I've emailed the Young user (IN06562363) wanting this version.

Will now build the GPU version on Myriad.

Apr 22 '24 11:04 balston

Running:

module unload compilers mpi gcc-libs
module load gcc-libs/10.2.0
module load compilers/nvidia/hpc-sdk/22.9

cd /shared/ucl/apps/build_scripts

./quantum-espresso-7.3.1+git+GPU_install 2>&1 | tee ~/Scratch/Software/QuantumEspresso/quantum-espresso-7.3.1+git+GPU_install.log

on a Myriad A100 GPU node as ccspapp.

Apr 22 '24 11:04 balston

Myriad build failed with:

make[1]: Leaving directory `/lustre/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/q-e/UtilXlib'
cd install ; make -f extlibs_makefile libcuda
make[1]: Entering directory `/lustre/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/q-e/install'
initializing external/devxlib submodule ...
usage: git submodule [--quiet] add [-b <branch>] [-f|--force] [--name <name>] [--reference <repository>] [--] <repository> [<path>]
   or: git submodule [--quiet] status [--cached] [--recursive] [--] [<path>...]
   or: git submodule [--quiet] init [--] [<path>...]
   or: git submodule [--quiet] deinit [-f|--force] [--] <path>...
   or: git submodule [--quiet] update [--init] [--remote] [-N|--no-fetch] [-f|--force] [--rebase] [--reference <repository>] [--merge] [--recursive] [--] [<path>...]
   or: git submodule [--quiet] summary [--cached|--files] [--summary-limit <n>] [commit] [--] [<path>...]
   or: git submodule [--quiet] foreach [--recursive] <command>
   or: git submodule [--quiet] sync [--recursive] [--] [<path>...]
make[1]: *** [libcuda_devxlib] Error 1
make[1]: Leaving directory `/lustre/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/q-e/install'
make: *** [libcuda] Error 2

Apr 22 '24 12:04 balston

I might have been able to fix this problem. Re running the build to see if it works.

Apr 22 '24 15:04 balston

The build now runs without errors on Myriad.

I'm now going to submit a job to run the test suite on Myriad:

qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
 470426 3.21148 QE-7.3.1_G ccspapp      qw    04/23/2024 14:01:16                                    2

Apr 23 '24 14:04 balston

I've done a test build of the CPU MPI variant in my Scratch on Kathleen and run the pw test on 4 cores:

export NSLOTS=4
make run-tests-pw NPROCS=$NSLOTS 2>&1 | tee ~/Scratch/Software/QuantumEspresso/run-tests.log

and got:

All done. 246 out of 246 tests passed (1 skipped).

Now sorting out the build script.

Apr 23 '24 14:04 balston

build script for CPU/MPI variant done and running from ccspapp on Kathleen.

 ./quantum-espresso-7.3.1+git_install 2>&1 | tee ~/Software/QuantumESPRESSO/quantum-espresso-7.3.1+git_install.log

Apr 23 '24 15:04 balston

Submitted a longer GPU example job on Myriad - 4 A100 GPUs and 8 MPI procs

Apr 24 '24 10:04 balston

My 4 A100 GPUs and 8 MPI procs example works.

Apr 24 '24 15:04 balston

I have informed the User who wanted the GPU version on Myriad.

Apr 25 '24 14:04 balston

The request for the CPU only variant was from IN06568900 also for Young.

Running the default test suite on Kathleen of this variant has finished. I will now run the build script on Young.

Apr 25 '24 14:04 balston

build of the CPU variant on Young has completed. Will run the tests tomorrow.

Apr 25 '24 16:04 balston

CPU variant job to run default test suite submitted on Young:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
1356143 0.00000 QE-7.3.1_C ccspapp      qw    04/26/2024 09:47:48                                    8

Apr 26 '24 08:04 balston

CPU tests ran successfully so producing module file.

Apr 29 '24 11:04 balston

module file done and pulled to Young and Kathleen. User wanting the CPU variant informed.

Apr 30 '24 08:04 balston

build of CPU variant finished on Myriad late yesterday. Will now run test suite.

Apr 30 '24 08:04 balston

CPU/MPI variant test suite job submitted on Myriad:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
 571215 0.00000 QE-7.3.1_C ccspapp      qw    04/30/2024 10:18:43                                    4

Apr 30 '24 09:04 balston

rcps-buildscripts rcps-buildscripts copied to clipboard

Install Request: Quantum Espresso 7.3 GPU and CPU variants

rcps-buildscripts
rcps-buildscripts copied to clipboard