rcps-buildscripts icon indicating copy to clipboard operation
rcps-buildscripts copied to clipboard

Install Request: Quantum Espresso 7.3 GPU and CPU variants

Open heatherkellyucl opened this issue 1 year ago • 35 comments

IN:06165073

Recently on the Quantum Espresso mailing list a group posted impressive performance with the GPU version of the software.

They used the exact same GPUs that are available on the Young Cluster. Would it be possible for you to compile the GPU enabled 7.2 version of the software and to make it available via module load?

Spack 0.20 has 7.1 with cuda variant available. (Might be a straightforward update to get it to build 7.2, might not).

heatherkellyucl avatar Sep 01 '23 14:09 heatherkellyucl

Kai and myself have been helping a user on Young [IN06562363] get a working GPU build of the latest QUANTUM Espresso and we also have a Myriad user wanting it [IN06570525]. As we have had to build it ourselves to know how to make it work, it makes sense to install this as a central install on both clusters.

balston avatar Apr 16 '24 15:04 balston

Current latest version is 7.3.1 so updated the title.

balston avatar Apr 16 '24 15:04 balston

My build which has been done on Young is currently running part of the test suite in and interactive session on a Young GPU node with 1 GPU and 4 MPI procs. I'm running them using:

module unload compilers mpi gcc-libs
module load gcc-libs/10.2.0
module load compilers/nvidia/hpc-sdk/22.9

# To allow the test suite to run
module load python3/recommended

cd /qe-7.3.1-GitHub/test-suite
make run-tests-pw NPROCS=4 2>&1  | tee ../../run-tests-pw.log

balston avatar Apr 16 '24 15:04 balston

I'm currently using the following to build QUANTUM Espresso 7.3.1 on Young. Note the build must be done on a GPU node and not on the login nodes:

module unload compilers mpi gcc-libs
module load gcc-libs/10.2.0
module load compilers/nvidia/hpc-sdk/22.9

cd ./qe-7.3.1-GitHub
./configure --prefix=XXX/quantum-espresso/7.3.1  --with-cuda=/shared/ucl/apps/nvhpc/2022_221/Linux_x86_64/22.1/cuda  --with-cuda-runtime=11.7 --with-cuda-cc=80 --enable-openmp --with-cuda-mpi=yes
make all
make install

balston avatar Apr 16 '24 16:04 balston

The test subset I was running has finally finished:

All done. ERROR: only 244 out of 246 tests passed (1 skipped).
Failed tests in:
        /lustre/scratch/ccaabaa/Software/QuantumEspresso/qe-7.3.1-GitHub/test-suite/pw_workflow_exx_nscf/
Skipped test in:
        /lustre/scratch/ccaabaa/Software/QuantumEspresso/qe-7.3.1-GitHub/test-suite/pw_metaGGA/
make: *** [run-tests-pw] Error 1

One failed test which needs to be investigated.

balston avatar Apr 16 '24 16:04 balston

I now have a build script for installing into

/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/

running this on Young,

balston avatar Apr 17 '24 14:04 balston

Build finished without errors so I have a job running as ccspapp to run the test suite on a GPU node using:


#$ -pe mpi 4
#$ -l gpu=1

module unload compilers mpi gcc-libs
module load gcc-libs/10.2.0
module load compilers/nvidia/hpc-sdk/22.9

# To allow the test suite to run
module load python3/recommended

export PATH=${ESPRESSO_ROOT}/bin:$PATH
cd $ESPRESSO_ROOT/test-suite
make run-tests NPROCS=$NSLOTS 2>&1 | tee /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log

Job is:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
1343927 3.50000 QE-7.3.1_G ccspapp      r     04/17/2024 17:03:51 [email protected]     4

balston avatar Apr 17 '24 16:04 balston

Job running the test suite finished overnight. The following tests failed:

pw_workflow_exx_nscf - uspp-k-restart-1.in (arg(s): 1): **FAILED**.
Different sets of data extracted from benchmark and test.
    Data only in benchmark: ef1, n1, band, e1.

pw_workflow_exx_nscf - uspp-k-restart-2.in (arg(s): 2): **FAILED**.
Different sets of data extracted from benchmark and test.
    Data only in benchmark: ef1, n1, band, e1.

All done. ERROR: only 244 out of 246 tests passed (1 skipped).
Failed tests in:
        /lustre/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/q-e/test-suite/pw_workflow_exx_nscf/
Skipped test in:
        /lustre/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/q-e/test-suite/pw_metaGGA/
make: *** [run-tests-pw] Error 1

unfortunately the pw tests failing stopped other tests from starting. Investigating ...

balston avatar Apr 18 '24 08:04 balston

I'm getting:

     GPU acceleration is ACTIVE.  1 visible GPUs per MPI rank
     GPU-aware MPI enabled

     Message from routine print_cuda_info:
     High GPU oversubscription detected. Are you sure this is what you want?

for the failed tests.

balston avatar Apr 18 '24 11:04 balston

I successfully ran the failed tests with 2 GPUs so modified the full test job to use 2 GPUs and resubmitted it.

balston avatar Apr 18 '24 16:04 balston

So the job to run the test suite runs the following tests:

cd $ESPRESSO_ROOT/test-suite

# Run all the default set of tests - pw, cp, ph, epw, hp, tddfpt, kcw

make run-tests-pw NPROCS=$NSLOTS 2>&1 | tee /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
make run-tests-cp NPROCS=$NSLOTS 2>&1 | tee -a /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
make run-tests-ph NPROCS=$NSLOTS 2>&1 | tee -a /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
make run-tests-epw NPROCS=$NSLOTS 2>&1 | tee -a /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
make run-tests-hp NPROCS=$NSLOTS 2>&1 | tee -a /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
make run-tests-tddfpt NPROCS=$NSLOTS 2>&1 | tee -a /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
make run-tests-kcw NPROCS=$NSLOTS 2>&1 | tee -a /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log

the job took just over two hours to run. All the pw tests passed but some of the other tests failed and will need to be investigated. The log of the tests has been copied to here:

/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/q-e/run-tests.log-18042024

I have submitted a longer example job running the pw.x command:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
1345043 0.00000 QE-7.3.1_G ccaabaa      qw    04/19/2024 10:12:43                                    8

which requests 4 GPUs and 8 MPI processes.

balston avatar Apr 19 '24 09:04 balston

My example job has run successfully so I'm going to make a module for this version and make it available on Young.

balston avatar Apr 19 '24 11:04 balston

The module file is done and I've submitted a job to test that the module is correctly set up. Will check on Monday.

balston avatar Apr 19 '24 16:04 balston

Test job worked with the module file so I've emailed the Young user (IN06562363) wanting this version.

Will now build the GPU version on Myriad.

balston avatar Apr 22 '24 11:04 balston

Running:

module unload compilers mpi gcc-libs
module load gcc-libs/10.2.0
module load compilers/nvidia/hpc-sdk/22.9

cd /shared/ucl/apps/build_scripts

./quantum-espresso-7.3.1+git+GPU_install 2>&1 | tee ~/Scratch/Software/QuantumEspresso/quantum-espresso-7.3.1+git+GPU_install.log

on a Myriad A100 GPU node as ccspapp.

balston avatar Apr 22 '24 11:04 balston

Myriad build failed with:

make[1]: Leaving directory `/lustre/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/q-e/UtilXlib'
cd install ; make -f extlibs_makefile libcuda
make[1]: Entering directory `/lustre/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/q-e/install'
initializing external/devxlib submodule ...
usage: git submodule [--quiet] add [-b <branch>] [-f|--force] [--name <name>] [--reference <repository>] [--] <repository> [<path>]
   or: git submodule [--quiet] status [--cached] [--recursive] [--] [<path>...]
   or: git submodule [--quiet] init [--] [<path>...]
   or: git submodule [--quiet] deinit [-f|--force] [--] <path>...
   or: git submodule [--quiet] update [--init] [--remote] [-N|--no-fetch] [-f|--force] [--rebase] [--reference <repository>] [--merge] [--recursive] [--] [<path>...]
   or: git submodule [--quiet] summary [--cached|--files] [--summary-limit <n>] [commit] [--] [<path>...]
   or: git submodule [--quiet] foreach [--recursive] <command>
   or: git submodule [--quiet] sync [--recursive] [--] [<path>...]
make[1]: *** [libcuda_devxlib] Error 1
make[1]: Leaving directory `/lustre/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/q-e/install'
make: *** [libcuda] Error 2

balston avatar Apr 22 '24 12:04 balston

I might have been able to fix this problem. Re running the build to see if it works.

balston avatar Apr 22 '24 15:04 balston

The build now runs without errors on Myriad.

I'm now going to submit a job to run the test suite on Myriad:

qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
 470426 3.21148 QE-7.3.1_G ccspapp      qw    04/23/2024 14:01:16                                    2

balston avatar Apr 23 '24 14:04 balston

I've done a test build of the CPU MPI variant in my Scratch on Kathleen and run the pw test on 4 cores:

export NSLOTS=4
make run-tests-pw NPROCS=$NSLOTS 2>&1 | tee ~/Scratch/Software/QuantumEspresso/run-tests.log 

and got:

All done. 246 out of 246 tests passed (1 skipped).

Now sorting out the build script.

balston avatar Apr 23 '24 14:04 balston

build script for CPU/MPI variant done and running from ccspapp on Kathleen.

 ./quantum-espresso-7.3.1+git_install 2>&1 | tee ~/Software/QuantumESPRESSO/quantum-espresso-7.3.1+git_install.log

balston avatar Apr 23 '24 15:04 balston

Submitted a longer GPU example job on Myriad - 4 A100 GPUs and 8 MPI procs

balston avatar Apr 24 '24 10:04 balston

My 4 A100 GPUs and 8 MPI procs example works.

balston avatar Apr 24 '24 15:04 balston

I have informed the User who wanted the GPU version on Myriad.

balston avatar Apr 25 '24 14:04 balston

The request for the CPU only variant was from IN06568900 also for Young.

Running the default test suite on Kathleen of this variant has finished. I will now run the build script on Young.

balston avatar Apr 25 '24 14:04 balston

build of the CPU variant on Young has completed. Will run the tests tomorrow.

balston avatar Apr 25 '24 16:04 balston

CPU variant job to run default test suite submitted on Young:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
1356143 0.00000 QE-7.3.1_C ccspapp      qw    04/26/2024 09:47:48                                    8

balston avatar Apr 26 '24 08:04 balston

CPU tests ran successfully so producing module file.

balston avatar Apr 29 '24 11:04 balston

module file done and pulled to Young and Kathleen. User wanting the CPU variant informed.

balston avatar Apr 30 '24 08:04 balston

build of CPU variant finished on Myriad late yesterday. Will now run test suite.

balston avatar Apr 30 '24 08:04 balston

CPU/MPI variant test suite job submitted on Myriad:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
 571215 0.00000 QE-7.3.1_C ccspapp      qw    04/30/2024 10:18:43                                    4

balston avatar Apr 30 '24 09:04 balston