rcps-buildscripts
rcps-buildscripts copied to clipboard
Install Request: GPU build of NAMD
EPSRC work.
Current version is 2.14. (Unless we are meant to be looking at the 3.0 alpha?)
https://www.ks.uiuc.edu/Research/namd/2.14/ug/ https://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD
They have a few prebuilt CUDA binaries.
Having a look at what the available binaries are:
Version 2.14 (2020-08-05) Platforms:
- Linux-x86_64-multicore-CUDA (NVIDIA CUDA acceleration)
- Linux-x86_64-netlrts-smp-CUDA (Multi-copy algorithms, single process per copy)
- Linux-x86_64-verbs-smp-CUDA (InfiniBand, no MPI needed, supports multi-copy algorithms)
Version 3.0 GPU-Resident Single-Node-Per-Replicate ALPHA Release (2020-11-16) Platforms:
- Linux-x86_64-multicore-CUDA-SingleNode (NVIDIA CUDA acceleration (single-node))
- Linux-x86_64-netlrts-smp-CUDA-SingleNode (NVIDIA CUDA acceleration, multi-copy algorithms, single process per copy)
NGC has a container for 3.0_alpha3-singlenode and suggests the ApoA1 benchmark to test: https://catalog.ngc.nvidia.com/orgs/hpc/containers/namd
From that, I think building 2.14 might be the thing to do? And maybe also checking if the 2.14 binary works, to compare.
https://www.ks.uiuc.edu/Research/namd/alpha/3.0alpha/
What MD Simulations Are Well-Suited to NAMD 3.0 Alpha Versions?
This scheme is intended for small to medium systems (10 thousand to 1 million atoms). For larger simulations, you should stick to the regular integration scheme, e.g., as used in NAMD 2.x.
This scheme is intended for modern GPUs, and it might slow your simulation down if you are not running on a Volta, Turing, or Ampere GPU! If your GPU is older, we recommend that you stick to NAMD 2.x.
The single-node version of NAMD 3.0 has almost everything offloaded to the GPU, so large CPU core counts are NOT necessary to get good performance. We recommend running NAMD with a low +p count, maybe 2-4 depending on system size, especially if the user plans on running multiple replica simulations within a node.
Note: there's a mistake in the instructions on the NGC. They've put underscores in the listed tags instead of hyphens.
So:
# Correct
export NAMD_TAG=3.0-alpha3-singlenode
# Incorrect
export NAMD_TAG=3.0_alpha3-singlenode
NAMD_2.14_Linux-x86_64-multicore-CUDA
binary seems to have found the GPU and done something with it.
Was run with 1 GPU and 36 cores as
../NAMD_2.14_Linux-x86_64-multicore-CUDA/namd2 +p${NSLOTS} +setcpuaffinity ../apoa1/apoa1_nve_cuda.namd
Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode: 36 threads (PEs)
Charm++> Using recursive bisection (scheme 3) for topology aware partitions
Converse/Charm++ Commit ID: v6.10.2-0-g7bf00fa-namd-charm-6.10.2-build-2020-Aug-05-556
Warning> Randomization of virtual memory (ASLR) is turned on in the kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_spac
e' as root to disable it, or try running with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> cpu affinity enabled.
Charm++> Running on 1 hosts (2 sockets x 18 cores x 1 PUs = 36-way SMP)
Charm++> cpu topology info is gathered in 0.003 seconds.
Info: Built with CUDA version 10010
Did not find +devices i,j,k,... argument, using all
Pe 24 physical rank 24 will use CUDA device of pe 32
Pe 16 physical rank 16 will use CUDA device of pe 32
...
Pe 32 physical rank 32 binding to CUDA device 0 on node-l00a-004.myriad.ucl.ac.uk: 'A100-PCIE-40GB' Mem: 40536MB Rev: 8.0 PCI: 0:6:0
Info: NAMD 2.14 for Linux-x86_64-multicore-CUDA
...
Info: PME using 1 x 1 x 1 pencil grid for FFT and reciprocal sum.
Info: Startup phase 7 took 0.21894 s, 450.078 MB of memory in use
Info: Updated CUDA force table with 4096 elements.
Info: Updated CUDA LJ table with 83 x 83 elements.
Info: Startup phase 8 took 0.234264 s, 451.047 MB of memory in use
...
TIMING: 10000 CPU: 26.2844, 0.00221166/step Wall: 27.3432, 0.00223027/step, 0 hours remaining, 571.800781 MB of memory in use.
ETITLE: TS BOND ANGLE DIHED IMPRP ELECT VDW BOUNDARY MISC KINETIC TOTAL TEMP POTENTIAL TOTAL3 TEMPAVG PRESSURE GPRESSURE VOLUME PRESSAVG GPRESSAVG
ENERGY: 10000 2204.6046 11354.0607 5751.7660 208.6148 -306846.5028 19604.1665 0.0000 0.0000 35965.7762 -231757.5141 184.7014 -267723.2903 -231565.0984 183.9502 -1479.2899 -1450.3310 921491.4634 -1567.0594 -1567.0459
WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 10000
WRITING COORDINATES TO OUTPUT FILE AT STEP 10000
The last position output (seq=-2) takes 0.009 seconds, 580.863 MB of memory in use
WRITING VELOCITIES TO OUTPUT FILE AT STEP 10000
The last velocity output (seq=-2) takes 0.003 seconds, 581.035 MB of memory in use
====================================================
WallClock: 58.677982 CPUTime: 54.226124 Memory: 581.050781 MB
[Partition 0][Node 0] End of program
Also worked and allocated cores with 4 gpus:
Pe 16 physical rank 16 binding to CUDA device 1 on node-l00a-001.myriad.ucl.ac.uk: 'A100-PCIE-40GB' Mem: 40536MB Rev: 8.0 PCI: 0:2f:0
Pe 32 physical rank 32 binding to CUDA device 3 on node-l00a-001.myriad.ucl.ac.uk: 'A100-PCIE-40GB' Mem: 40536MB Rev: 8.0 PCI: 0:d8:0
Pe 24 physical rank 24 binding to CUDA device 2 on node-l00a-001.myriad.ucl.ac.uk: 'A100-PCIE-40GB' Mem: 40536MB Rev: 8.0 PCI: 0:86:0
Pe 8 physical rank 8 binding to CUDA device 0 on node-l00a-001.myriad.ucl.ac.uk: 'A100-PCIE-40GB' Mem: 40536MB Rev: 8.0 PCI: 0:6:0
...
TIMING: 10000 CPU: 16.0314, 0.00159462/step Wall: 16.2274, 0.001611/step, 0 hours remaining, 1397.472656 MB of memory in use.
ETITLE: TS BOND ANGLE DIHED IMPRP ELECT VDW BOUNDARY MISC KINETIC TOTAL TEMP POTENTIAL TOTAL3 TEMPAVG PRESSURE GPRESSURE VOLUME PRESSAVG GPRESSAVG
ENERGY: 10000 2194.5660 11382.4649 5685.5128 189.6088 -306784.1328 19604.7429 0.0000 0.0000 35969.0949 -231758.1425 184.7185 -267727.2374 -231564.1706 183.9395 -1507.5271 -1479.5944 921491.4634 -1521.8678 -1521.8701
WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 10000
WRITING COORDINATES TO OUTPUT FILE AT STEP 10000
The last position output (seq=-2) takes 0.012 seconds, 1405.859 MB of memory in use
WRITING VELOCITIES TO OUTPUT FILE AT STEP 10000
The last velocity output (seq=-2) takes 0.007 seconds, 1406.047 MB of memory in use
====================================================
WallClock: 40.250298 CPUTime: 21.173227 Memory: 1406.062500 MB
[Partition 0][Node 0] End of program
After discussion (was partway through modifying our current buildscripts which use Intel to ones using GCC and CUDA): we should update the CUDA modules so they no longer require the gnu compiler to be loaded (because they don't) and only depend on gcc-libs. It should be fine to build programs that work with the Intel compiler and the newer gcc-libs.
(NAMD still has a fair bit of CPU computation done in its GPU version, so the non-CUDA compiler is still important).
It may be useful to test Intel + CUDA performance against GCC 10 + CUDA performance and so build both versions.
Hmm, the Intel 2018 module has symlinks in its intel64/lib
directory to a bunch of libraries that are in release_mt
including libmpi.so and .a.
Our Intel 2019 install does not have the symlinks. And so charm++ which directly does -lmpi
cannot find them. (It is doing build checks with icc rather than mpicxx).
Intel restructured the layout in 2019. ~~Will add to module paths for the very few things that look for them.~~ This is bad, at least for scalapack - can sort it for this build process only.
Charm++ does not like that combo :(
/lustre/shared/ucl/apps/gcc/10.2.0-p95889/bin/../include/c++/10.2.0/bits/atomic_base.h(74): error: invalid redefinition of enum "std::memory_order" (declar
ed at line 168 of "/lustre/shared/ucl/apps/intel/2019.Update5/compilers_and_libraries_2019.5.281/linux/compiler/include/stdatomic.h")
typedef enum memory_order
compilation aborted for DummyLB.C (code 2)
Fatal Error by charmc in directory /home/cceahke/namd/namd-2.14-cuda/NAMD_2.14_Source/charm-6.10.2/mpi-linux-x86_64-iccstatic/tmp
Command icpc -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX -I../bin/../include -D__CHARMC__=1 -DCMK_OPTIMIZE -I. -xHost -O2 -U_FORTIFY_SOURCE -c DummyLB.C -o DummyLB.o returned error code 2
charmc exiting...
Older gcc-libs underneath without those fancy C++ atomics ought to work.
Charm++ builds, then NAMD configure complains:
ERROR: MPI-based Charm++ arch mpi-linux-x86_64-iccstatic is not compatible with CUDA NAMD.
ERROR: Non-SMP Charm++ arch mpi-linux-x86_64-iccstatic is not compatible with CUDA NAMD.
ERROR: CUDA builds require non-MPI SMP or multicore Charm++ arch for reasonable performance.
Consider ucx-smp or verbs-smp (InfiniBand), gni-smp (Cray), or multicore (single node).
An InfiniBand network is highly recommended when running CUDA-accelerated NAMD across multiple nodes. You will need either an ibverbs NAMD binary (available for download) or an MPI NAMD binary (must build Charm++ and NAMD as described above) to make use of the InfiniBand network. The use of SMP binaries is also recommended when running on multiple nodes, with one process per GPU and as many threads as available cores, reserving one core per process for the communication thread.
Then in https://www.ks.uiuc.edu/Research/namd/2.14/ug/node102.html
Intel Omni-Path networks are incompatible with the pre-built ibverbs NAMD binaries. Charm++ for verbs can be built with -with-qlogic to support Omni-Path, but the Charm++ MPI network layer performs better than the verbs layer. Hangs have been observed with Intel MPI but not with OpenMPI, so OpenMPI is preferred. See ``Compiling NAMD'' below for MPI build instructions. NAMD MPI binaries may be launched directly with mpiexec rather than via the provided charmrun script.
Compiling NAMD:
We provide complete and optimized binaries for all common platforms to which NAMD has been ported. It should not be necessary for you to compile NAMD unless you wish to add or modify features or to improve performance by using an MPI library that takes advantage of special networking hardware.
Directions for compiling NAMD are contained in the release notes, which are available from the NAMD web site http://www.ks.uiuc.edu/Research/namd/ and are included in all distributions.
https://www.ks.uiuc.edu/Research/namd/2.14/ug/node104.html:
Shared-Memory and Network-Based Parallelism (SMP Builds)
The Linux-x86_64-ibverbs-smp and Solaris-x86_64-smp released binaries are based on ``smp'' builds of Charm++ that can be used with multiple threads on either a single machine like a multicore build, or across a network. SMP builds combine multiple worker threads and an extra communication thread into a single process. Since one core per process is used for the communication thread SMP builds are typically slower than non-SMP builds. The advantage of SMP builds is that many data structures are shared among the threads, reducing the per-core memory footprint when scaling large simulations to large numbers of cores.
SMP builds launched with charmrun use ++n to specify the total number of processes (Charm++ "nodes") and ++ppn to specify the number of PEs (Charm++ worker threads) per process. Prevous versions required the use of +p to specify the total number of PEs, but the new ++n option is now recommended. Thus, to run one process with one communication and three worker threads on each of four quad-core nodes one would specify: charmrun namd2 ++n 4 ++ppn 3
For MPI-based SMP builds one would specify any mpiexec options needed for the required number of processes and pass +ppn to the NAMD binary as: mpiexec -n 4 namd2 +ppn 3
When in doubt, check Compute Canada: https://docs.computecanada.ca/wiki/NAMD/en#Parallel_GPU_jobs
They use OFI GPU on their OmniPath interconnect machine and UCX GPU on Infiniband machines.
If we need to do that, we aren't going to be able to test an OPA GPU build before the hardware arrives. (Can test a CPU-only parallel OFI build on Young).
Top priority is now to get NAMD OFI CPU working on Young with gerun using charmrun - then we can make the OFI CUDA one work when the GPUs exist.
Notes on charmrun and SGE: https://www.ks.uiuc.edu/Research/namd/wiki/index.cgi?NamdOnGridEngine
This is quite handy, about building multiple versions and comparing them: https://docs.hpc.wvu.edu/text/609.CHARM++_NAMD.html
Note: in our Linux-x86_64-icc.arch we have -qopenmp-simd
(and so does the link above, which are CPU versions). Compute Canada's builds set -qno-openmp-simd
(https://github.com/ComputeCanada/easybuild-easyconfigs/tree/computecanada-main/easybuild/easyconfigs/n/NAMD). I don't know which we want.
(Currently have a charm++ ofi-smp built on Young, building namd).
On Young:
- [x] NAMD 2.14 ofi-smp with openmp-simd
- [x] benchmarking jobs
- [x] NAMD 2.14 ofi-smp without openmp-simd
- [x] benchmarking jobs
On Myriad:
- [x] NAMD 2.14 multicore CUDA binary
- [ ] benchmarking jobs
- [ ] NAMD 2.14 multicore CUDA from source (not much point, really)
- [ ] benchmarking jobs
Extras
- [ ] NAMD 2.14 ucx versions on Myriad (should have, is less urgent)
Run the apoa1 benchmark twice in the same job to remove the initial FFT optimisation when comparing timings, and look at the second set of times.
The smp version takes one extra thread for managing communication reducing the amount of usable cores but allows to share memory efficiently allowing for computations with larger systems. That is why having the smp and non-smp versions is recommended.
Should probably also have ofi and ucx versions without smp for this reason.
Submitted a job with the first ofi-smd version.
Young, apoa1, namd_ofismp_nosimd_12
, end of second run
Running on 6 processors: namd2 apoa1.namd ++ppn2
charmrun> /bin/setarch x86_64 -R mpirun -np 6 namd2 apoa1.namd ++ppn2
Charm++>ofi> provider: psm2
Charm++>ofi> control progress: 2
Charm++>ofi> data progress: 2
Charm++>ofi> maximum inject message size: 64
Charm++>ofi> eager maximum message size: 65536 (maximum header size: 40)
Charm++>ofi> cq entries count: 8
Charm++>ofi> use inject: 1
Charm++>ofi> maximum rma size: 4294967295
Charm++>ofi> mr mode: 0x1
Charm++>ofi> use memory pool: 0
Charm++>ofi> use request cache: 0
Charm++>ofi> number of pre-allocated recvs: 8
Charm++>ofi> exchanging addresses over OFI
Charm++> Running in SMP mode: 6 processes, 2 worker threads (PEs) + 1 comm threads per process, 12 PEs total
Charm++> The comm. thread both sends and receives messages
Charm++> Using recursive bisection (scheme 3) for topology aware partitions
Converse/Charm++ Commit ID: v6.10.2-0-g7bf00fa-namd-charm-6.10.2-build-2020-Aug-05-556
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 hosts (2 sockets x 20 cores x 1 PUs = 40-way SMP)
Charm++> cpu topology info is gathered in 0.004 seconds.
Info: Benchmark time: 12 CPUs 0.0453018 s/step 0.524327 days/ns 892.32 MB memory
TIMING: 500 CPU: 23.5762, 0.0448887/step Wall: 23.6368, 0.0450001/step, 0 hours remaining, 892.320312 MB of memory in use.
ETITLE: TS BOND ANGLE DIHED IMPRP ELECT VDW BOUNDARY MISC KINETIC TOTAL TEMP POTENTIAL TOTAL3 TEMPAVG PRESSURE GPRESSURE VOLUME PRESSAVG GPRESSAVG
ENERGY: 500 20974.8941 19756.6582 5724.4523 179.8271 -337741.4181 23251.1002 0.0000 0.0000 45359.0771 -222495.4091 165.0039 -267854.4862 -222061.0909 165.0039 -3197.5173 -2425.4144 921491.4634 -3197.5173 -2425.4144
WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 500
WRITING COORDINATES TO OUTPUT FILE AT STEP 500
The last position output (seq=-2) takes 0.006 seconds, 896.949 MB of memory in use
WRITING VELOCITIES TO OUTPUT FILE AT STEP 500
The last velocity output (seq=-2) takes 0.012 seconds, 896.949 MB of memory in use
====================================================
WallClock: 25.280815 CPUTime: 25.181173 Memory: 896.949219 MB
namd_ofismp_12
Running on 6 processors: namd2 apoa1.namd ++ppn2
charmrun> /bin/setarch x86_64 -R mpirun -np 6 namd2 apoa1.namd ++ppn2
Charm++>ofi> provider: psm2
Charm++>ofi> control progress: 2
Charm++>ofi> data progress: 2
Charm++>ofi> maximum inject message size: 64
Charm++>ofi> eager maximum message size: 65536 (maximum header size: 40)
Charm++>ofi> cq entries count: 8
Charm++>ofi> use inject: 1
Charm++>ofi> maximum rma size: 4294967295
Charm++>ofi> mr mode: 0x1
Charm++>ofi> use memory pool: 0
Charm++>ofi> use request cache: 0
Charm++>ofi> number of pre-allocated recvs: 8
Charm++>ofi> exchanging addresses over OFI
Charm++> Running in SMP mode: 6 processes, 2 worker threads (PEs) + 1 comm threads per process, 12 PEs total
Charm++> The comm. thread both sends and receives messages
Charm++> Using recursive bisection (scheme 3) for topology aware partitions
Converse/Charm++ Commit ID: v6.10.2-0-g7bf00fa-namd-charm-6.10.2-build-2020-Aug-05-556
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 hosts (2 sockets x 20 cores x 1 PUs = 40-way SMP)
Charm++> cpu topology info is gathered in 0.004 seconds.
Info: Benchmark time: 12 CPUs 0.0471704 s/step 0.545953 days/ns 755.539 MB memory
TIMING: 500 CPU: 24.6084, 0.0468049/step Wall: 24.6676, 0.0468979/step, 0 hours remaining, 755.539062 MB of memory in use.
ETITLE: TS BOND ANGLE DIHED IMPRP ELECT VDW BOUNDARY MISC KINETIC TOTAL TEMP POTENTIAL TOTAL3 TEMPAVG PRESSURE GPRESSURE VOLUME PRESSAVG GPRESSAVG
ENERGY: 500 20974.8944 19756.6576 5724.4523 179.8271 -337741.4177 23251.0995 0.0000 0.0000 45359.0774 -222495.4094 165.0039 -267854.4868 -222061.0912 165.0039 -3197.5178 -2425.4147 921491.4634 -3197.5178 -2425.4147
WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 500
WRITING COORDINATES TO OUTPUT FILE AT STEP 500
The last position output (seq=-2) takes 0.026 seconds, 760.379 MB of memory in use
WRITING VELOCITIES TO OUTPUT FILE AT STEP 500
The last velocity output (seq=-2) takes 0.006 seconds, 760.379 MB of memory in use
====================================================
WallClock: 25.794182 CPUTime: 25.666487 Memory: 760.378906 MB
Trying to get them working multi-node, I am somewhat confused: the charmrun that NAMD has installed is calling mpirun itself, and so the examples that suggest doing this kind of thing and making a charm-suitable hostfile are not relevant:
# Convert SGE hostfile to charmrun hostfile.
nodefile=namd2.${JOB_ID}.nodelist
echo group main > $nodefile
awk '{ for (i=0;i<$2;++i) {print "host",$1} }' $PE_HOSTFILE >> $nodefile
charmrun ++remote-shell ssh ++nodelist $nodefile +p${NSLOTS} namd2 apoa1.namd ++ppn2 +setcpuaffinity
If you do those you get errors because it passes the ++remote-shell
(or ++nodelist
if you remove that) straight through to mpirun, which thinks it is an executable it is meant to run...
Oh well. We have mpirun as the launcher.
Need to try getting qrsh to launch the correct number of processes on the nodes.
Or not, actually - https://dl.acm.org/doi/pdf/10.1145/3219104.3219134 has been more useful than the main docs in terms of how things are meant to be run (there's a lot of info in the main docs, but ofi and especially ofi-cuda builds are fairly niche so aren't used as examples, and the segments given for each separately don't necessarily fit together, like the no mpi-smp for cuda builds part).
Builds based on MPI, Cray GNI, OFI, and IBM PAMI are launched the same as are MPI programs on the machine (mpirun, mpiexec, ibrun, aprun, jsrun, etc.).
Which does explain why we got a charmrun that does not have all the usual options used in examples.
I think we're having a process mapping issue at the moment. (Charm's terminology is confusing).
I stuck in some extra echoes into a charmrun-verbose
It is doing the processes vs threads division itself ($NSLOTS
is 80 here).
pes orig: 80, ppn: 2
pes after: 40
Running on 40 processors: -machinefile /tmpdir/job/556150.undefined/machines namd2 apoa1.namd ++ppn2 +setcpuaffinity
mpirun_cmd: /shared/ucl/apps/intel/2019.Update5/impi/2019.5.281/intel64/bin/mpirun
charmrun> /bin/setarch x86_64 -R mpirun -np 40 -machinefile /tmpdir/job/556150.undefined/machines namd2 apoa1.namd ++ppn2 +setcpuaffinity
Charm++> Running in SMP mode: 40 processes, 2 worker threads (PEs) + 1 comm threads per process, 80 PEs total
Charm++> The comm. thread both sends and receives messages
Charm++> Using recursive bisection (scheme 3) for topology aware partitions
Converse/Charm++ Commit ID: v6.10.2-0-g7bf00fa-namd-charm-6.10.2-build-2020-Aug-05-556
CharmLB> Load balancer assumes all CPUs are same.
Charm++> cpu affinity enabled.
Charm++> Running on 1 hosts (2 sockets x 20 cores x 1 PUs = 40-way SMP)
Charm++> cpu topology info is gathered in 0.291 seconds.
Charm++> Warning: the number of SMP threads (120) is greater than the number of physical cores (80), so threads will sleep while idling. Use +CmiSpinOnIdle or +CmiSl
eepOnIdle to control this directly.
WARNING: Multiple PEs assigned to same core, recommend adjusting processor affinity or passing +CmiSleepOnIdle to reduce interference.
Info: 1 NAMD 2.14 Linux-x86_64-ofi-smp 80 node-c12f-001 cceahke
Info: Running on 80 processors, 40 nodes, 1 physical nodes.
Charm++ uses an unconventional internal nomenclature that may appear in NAMD startup and error messages. A Charm++ “PE” (processing element) is a worker thread (typically a POSIX thread running on a dedicated hardware thread). A Charm++ “node” is a process (a set of PEs sharing a memory space). A Charm++ “physical node” is a host (running a set of Charm++ nodes that share network interfaces and GPUs, and which are assumed to communicate faster among themselves than with other hosts).
For smp builds on the above platforms [MPI, OFI etc], the number of worker threads (PEs) per process must be specifed as +ppn threads, e.g., “mpiexec -n 4 namd2 +ppn 7 ...” would launch 4 processes, each with 7 worker threads plus a communication thread, thus using a total of 32 cores (or hardware threads). Care must be taken to specify to the platform launch system the total number of threads (worker plus communication) for each process so that sufficient cores are reserved and affinity set, otherwise all of the NAMD threads may end up sharing a single core. Note that the PAMI and multicore platforms lack a separate communication thread and may thus use all cores for computation. When running multi-copy algorithms with NAMD, if each replica is a single process then the communication thread will sleep when idle and thus also does not require a dedicated core.
https://github.com/UIUC-PPL/charm/issues/2059 is about hwloc issues when setting +pemap
and +comap
for jobs, but does show some where it is incorrectly running on 1 host vs correctly running on 2 hosts. So we need to set something to get this using 2 hosts.
(Was also wondering if it might do better automatically with the $TMPDIR/machines.unique
instead since I loaded an Intel MPI here).
I should check what the ofi-only version with no smp does on two nodes, then come back to this one.
Ahahah, ofi-smp job that was already in the queue with machines.unique
has worked:
Charm++> Running on 2 hosts (2 sockets x 20 cores x 1 PUs = 40-way SMP)
Info: 1 NAMD 2.14 Linux-x86_64-ofi-smp 80 node-c12i-001 cceahke
Info: Running on 80 processors, 40 nodes, 2 physical nodes.
and that was
charmrun -machinefile $TMPDIR/machines.unique +p${NSLOTS} namd2 apoa1.namd ++ppn2 +setcpuaffinity
Did some comparisons of options (4 repeats of each in one job, ignoring the results from the first where it does the ffts). Yes, +setcpuaffinity
for smp jobs! I wanted to make sure it was still helpful if you don't give any more explicit mappings.
80 cores (ofi-smp, charmrun +p${NSLOTS} namd2 apoa1.namd ++ppn2 +setcpuaffinity
):
The last velocity output (seq=-2) takes 0.002 seconds, 789.320 MB of memory in use
WallClock: 104.270256 CPUTime: 86.148232 Memory: 789.320312 MB
The last velocity output (seq=-2) takes 0.002 seconds, 783.336 MB of memory in use
WallClock: 104.243652 CPUTime: 84.784187 Memory: 783.335938 MB
The last velocity output (seq=-2) takes 0.003 seconds, 783.262 MB of memory in use
WallClock: 105.437546 CPUTime: 85.179848 Memory: 783.261719 MB
80 cores (ofi-smp, charmrun +p${NSLOTS} namd2 apoa1.namd ++ppn2
):
The last velocity output (seq=-2) takes 0.003 seconds, 789.391 MB of memory in use
WallClock: 148.292145 CPUTime: 101.689720 Memory: 789.390625 MB
The last velocity output (seq=-2) takes 0.003 seconds, 789.121 MB of memory in use
WallClock: 147.651932 CPUTime: 103.590492 Memory: 789.121094 MB
The last velocity output (seq=-2) takes 0.003 seconds, 792.082 MB of memory in use
WallClock: 152.005264 CPUTime: 101.689995 Memory: 792.082031 MB
The simd vs nosimd is inconclusive so may as well leave it on.
Final (CPU) builds on Young:
- [x] ofi-smp
- [x] ofi
- [x] charmrun wrapper(s)
- [x] modulefiles
- [x] user docs
Testing the charmrun wrapper.
Install on:
- [x] Young
- [x] Kathleen
- [x] Thomas (for Michael)
Modules are:
module unload -f compilers mpi
module load compilers/intel/2019/update5
module load mpi/intel/2019/update5/intel
module load namd/2.14/ofi/intel-2019
module unload -f compilers mpi
module load compilers/intel/2019/update5
module load mpi/intel/2019/update5/intel
module load namd/2.14/ofi-smp/intel-2019
Having an extra test job submitted using the central install on Young, just in case. (Worked).
Variety of job sizes also in the queue so can have a vague benchmark of up to and including 5 nodes.