rcps-buildscripts icon indicating copy to clipboard operation
rcps-buildscripts copied to clipboard

Install Request: GPU build of NAMD

Open heatherkellyucl opened this issue 3 years ago • 37 comments

EPSRC work.

Current version is 2.14. (Unless we are meant to be looking at the 3.0 alpha?)

https://www.ks.uiuc.edu/Research/namd/2.14/ug/ https://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD

They have a few prebuilt CUDA binaries.

heatherkellyucl avatar Jan 17 '22 11:01 heatherkellyucl

Having a look at what the available binaries are:

Version 2.14 (2020-08-05) Platforms:

  • Linux-x86_64-multicore-CUDA (NVIDIA CUDA acceleration)
  • Linux-x86_64-netlrts-smp-CUDA (Multi-copy algorithms, single process per copy)
  • Linux-x86_64-verbs-smp-CUDA (InfiniBand, no MPI needed, supports multi-copy algorithms)

Version 3.0 GPU-Resident Single-Node-Per-Replicate ALPHA Release (2020-11-16) Platforms:

  • Linux-x86_64-multicore-CUDA-SingleNode (NVIDIA CUDA acceleration (single-node))
  • Linux-x86_64-netlrts-smp-CUDA-SingleNode (NVIDIA CUDA acceleration, multi-copy algorithms, single process per copy)

NGC has a container for 3.0_alpha3-singlenode and suggests the ApoA1 benchmark to test: https://catalog.ngc.nvidia.com/orgs/hpc/containers/namd

From that, I think building 2.14 might be the thing to do? And maybe also checking if the 2.14 binary works, to compare.

https://www.ks.uiuc.edu/Research/namd/alpha/3.0alpha/

What MD Simulations Are Well-Suited to NAMD 3.0 Alpha Versions?

This scheme is intended for small to medium systems (10 thousand to 1 million atoms). For larger simulations, you should stick to the regular integration scheme, e.g., as used in NAMD 2.x.

This scheme is intended for modern GPUs, and it might slow your simulation down if you are not running on a Volta, Turing, or Ampere GPU! If your GPU is older, we recommend that you stick to NAMD 2.x.

The single-node version of NAMD 3.0 has almost everything offloaded to the GPU, so large CPU core counts are NOT necessary to get good performance. We recommend running NAMD with a low +p count, maybe 2-4 depending on system size, especially if the user plans on running multiple replica simulations within a node.

heatherkellyucl avatar Jan 27 '22 14:01 heatherkellyucl

Note: there's a mistake in the instructions on the NGC. They've put underscores in the listed tags instead of hyphens.

So:

# Correct
export NAMD_TAG=3.0-alpha3-singlenode

# Incorrect
export NAMD_TAG=3.0_alpha3-singlenode

ikirker avatar Jan 28 '22 15:01 ikirker

NAMD_2.14_Linux-x86_64-multicore-CUDA binary seems to have found the GPU and done something with it.

Was run with 1 GPU and 36 cores as

../NAMD_2.14_Linux-x86_64-multicore-CUDA/namd2 +p${NSLOTS} +setcpuaffinity ../apoa1/apoa1_nve_cuda.namd
Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode: 36 threads (PEs)
Charm++> Using recursive bisection (scheme 3) for topology aware partitions
Converse/Charm++ Commit ID: v6.10.2-0-g7bf00fa-namd-charm-6.10.2-build-2020-Aug-05-556
Warning> Randomization of virtual memory (ASLR) is turned on in the kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_spac
e' as root to disable it, or try running with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> cpu affinity enabled. 
Charm++> Running on 1 hosts (2 sockets x 18 cores x 1 PUs = 36-way SMP)
Charm++> cpu topology info is gathered in 0.003 seconds.
Info: Built with CUDA version 10010
Did not find +devices i,j,k,... argument, using all
Pe 24 physical rank 24 will use CUDA device of pe 32
Pe 16 physical rank 16 will use CUDA device of pe 32
...
Pe 32 physical rank 32 binding to CUDA device 0 on node-l00a-004.myriad.ucl.ac.uk: 'A100-PCIE-40GB'  Mem: 40536MB  Rev: 8.0  PCI: 0:6:0
Info: NAMD 2.14 for Linux-x86_64-multicore-CUDA
...
Info: PME using 1 x 1 x 1 pencil grid for FFT and reciprocal sum.
Info: Startup phase 7 took 0.21894 s, 450.078 MB of memory in use
Info: Updated CUDA force table with 4096 elements.
Info: Updated CUDA LJ table with 83 x 83 elements.
Info: Startup phase 8 took 0.234264 s, 451.047 MB of memory in use
...
TIMING: 10000  CPU: 26.2844, 0.00221166/step  Wall: 27.3432, 0.00223027/step, 0 hours remaining, 571.800781 MB of memory in use.
ETITLE:      TS           BOND          ANGLE          DIHED          IMPRP               ELECT            VDW       BOUNDARY           MISC        KINETIC               TOTAL           TEMP      POTENTIAL         TOTAL3        TEMPAVG            PRESSURE      GPRESSURE         VOLUME       PRESSAVG      GPRESSAVG

ENERGY:   10000      2204.6046     11354.0607      5751.7660       208.6148        -306846.5028     19604.1665         0.0000         0.0000     35965.7762        -231757.5141       184.7014   -267723.2903   -231565.0984       183.9502          -1479.2899     -1450.3310    921491.4634     -1567.0594     -1567.0459

WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 10000
WRITING COORDINATES TO OUTPUT FILE AT STEP 10000
The last position output (seq=-2) takes 0.009 seconds, 580.863 MB of memory in use
WRITING VELOCITIES TO OUTPUT FILE AT STEP 10000
The last velocity output (seq=-2) takes 0.003 seconds, 581.035 MB of memory in use
====================================================

WallClock: 58.677982  CPUTime: 54.226124  Memory: 581.050781 MB
[Partition 0][Node 0] End of program

heatherkellyucl avatar Jan 31 '22 09:01 heatherkellyucl

Also worked and allocated cores with 4 gpus:

Pe 16 physical rank 16 binding to CUDA device 1 on node-l00a-001.myriad.ucl.ac.uk: 'A100-PCIE-40GB'  Mem: 40536MB  Rev: 8.0  PCI: 0:2f:0
Pe 32 physical rank 32 binding to CUDA device 3 on node-l00a-001.myriad.ucl.ac.uk: 'A100-PCIE-40GB'  Mem: 40536MB  Rev: 8.0  PCI: 0:d8:0
Pe 24 physical rank 24 binding to CUDA device 2 on node-l00a-001.myriad.ucl.ac.uk: 'A100-PCIE-40GB'  Mem: 40536MB  Rev: 8.0  PCI: 0:86:0
Pe 8 physical rank 8 binding to CUDA device 0 on node-l00a-001.myriad.ucl.ac.uk: 'A100-PCIE-40GB'  Mem: 40536MB  Rev: 8.0  PCI: 0:6:0
...
TIMING: 10000  CPU: 16.0314, 0.00159462/step  Wall: 16.2274, 0.001611/step, 0 hours remaining, 1397.472656 MB of memory in use.
ETITLE:      TS           BOND          ANGLE          DIHED          IMPRP               ELECT            VDW       BOUNDARY           MISC        KINETIC               TOTAL           TEMP      POTENTIAL         TOTAL3        TEMPAVG            PRESSURE      GPRESSURE         VOLUME       PRESSAVG      GPRESSAVG

ENERGY:   10000      2194.5660     11382.4649      5685.5128       189.6088        -306784.1328     19604.7429         0.0000         0.0000     35969.0949        -231758.1425       184.7185   -267727.2374   -231564.1706       183.9395          -1507.5271     -1479.5944    921491.4634     -1521.8678     -1521.8701

WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 10000
WRITING COORDINATES TO OUTPUT FILE AT STEP 10000
The last position output (seq=-2) takes 0.012 seconds, 1405.859 MB of memory in use
WRITING VELOCITIES TO OUTPUT FILE AT STEP 10000
The last velocity output (seq=-2) takes 0.007 seconds, 1406.047 MB of memory in use
====================================================

WallClock: 40.250298  CPUTime: 21.173227  Memory: 1406.062500 MB
[Partition 0][Node 0] End of program

heatherkellyucl avatar Jan 31 '22 10:01 heatherkellyucl

After discussion (was partway through modifying our current buildscripts which use Intel to ones using GCC and CUDA): we should update the CUDA modules so they no longer require the gnu compiler to be loaded (because they don't) and only depend on gcc-libs. It should be fine to build programs that work with the Intel compiler and the newer gcc-libs.

(NAMD still has a fair bit of CPU computation done in its GPU version, so the non-CUDA compiler is still important).

It may be useful to test Intel + CUDA performance against GCC 10 + CUDA performance and so build both versions.

heatherkellyucl avatar Feb 02 '22 09:02 heatherkellyucl

Hmm, the Intel 2018 module has symlinks in its intel64/lib directory to a bunch of libraries that are in release_mt including libmpi.so and .a.

Our Intel 2019 install does not have the symlinks. And so charm++ which directly does -lmpi cannot find them. (It is doing build checks with icc rather than mpicxx).

Intel restructured the layout in 2019. ~~Will add to module paths for the very few things that look for them.~~ This is bad, at least for scalapack - can sort it for this build process only.

heatherkellyucl avatar Feb 02 '22 12:02 heatherkellyucl

Charm++ does not like that combo :(

/lustre/shared/ucl/apps/gcc/10.2.0-p95889/bin/../include/c++/10.2.0/bits/atomic_base.h(74): error: invalid redefinition of enum "std::memory_order" (declar
ed at line 168 of "/lustre/shared/ucl/apps/intel/2019.Update5/compilers_and_libraries_2019.5.281/linux/compiler/include/stdatomic.h")
    typedef enum memory_order  

compilation aborted for DummyLB.C (code 2)
Fatal Error by charmc in directory /home/cceahke/namd/namd-2.14-cuda/NAMD_2.14_Source/charm-6.10.2/mpi-linux-x86_64-iccstatic/tmp
   Command icpc -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX -I../bin/../include -D__CHARMC__=1 -DCMK_OPTIMIZE -I. -xHost -O2 -U_FORTIFY_SOURCE -c DummyLB.C -o DummyLB.o returned error code 2
charmc exiting...

Older gcc-libs underneath without those fancy C++ atomics ought to work.

heatherkellyucl avatar Feb 02 '22 13:02 heatherkellyucl

Charm++ builds, then NAMD configure complains:

ERROR: MPI-based Charm++ arch mpi-linux-x86_64-iccstatic is not compatible with CUDA NAMD.
ERROR: Non-SMP Charm++ arch mpi-linux-x86_64-iccstatic is not compatible with CUDA NAMD.
ERROR: CUDA builds require non-MPI SMP or multicore Charm++ arch for reasonable performance.

Consider ucx-smp or verbs-smp (InfiniBand), gni-smp (Cray), or multicore (single node).

heatherkellyucl avatar Feb 02 '22 14:02 heatherkellyucl

An InfiniBand network is highly recommended when running CUDA-accelerated NAMD across multiple nodes. You will need either an ibverbs NAMD binary (available for download) or an MPI NAMD binary (must build Charm++ and NAMD as described above) to make use of the InfiniBand network. The use of SMP binaries is also recommended when running on multiple nodes, with one process per GPU and as many threads as available cores, reserving one core per process for the communication thread.

Then in https://www.ks.uiuc.edu/Research/namd/2.14/ug/node102.html

Intel Omni-Path networks are incompatible with the pre-built ibverbs NAMD binaries. Charm++ for verbs can be built with -with-qlogic to support Omni-Path, but the Charm++ MPI network layer performs better than the verbs layer. Hangs have been observed with Intel MPI but not with OpenMPI, so OpenMPI is preferred. See ``Compiling NAMD'' below for MPI build instructions. NAMD MPI binaries may be launched directly with mpiexec rather than via the provided charmrun script.

Compiling NAMD:

We provide complete and optimized binaries for all common platforms to which NAMD has been ported. It should not be necessary for you to compile NAMD unless you wish to add or modify features or to improve performance by using an MPI library that takes advantage of special networking hardware.

Directions for compiling NAMD are contained in the release notes, which are available from the NAMD web site http://www.ks.uiuc.edu/Research/namd/ and are included in all distributions.

https://www.ks.uiuc.edu/Research/namd/2.14/ug/node104.html:

Shared-Memory and Network-Based Parallelism (SMP Builds)

The Linux-x86_64-ibverbs-smp and Solaris-x86_64-smp released binaries are based on ``smp'' builds of Charm++ that can be used with multiple threads on either a single machine like a multicore build, or across a network. SMP builds combine multiple worker threads and an extra communication thread into a single process. Since one core per process is used for the communication thread SMP builds are typically slower than non-SMP builds. The advantage of SMP builds is that many data structures are shared among the threads, reducing the per-core memory footprint when scaling large simulations to large numbers of cores.

SMP builds launched with charmrun use ++n to specify the total number of processes (Charm++ "nodes") and ++ppn to specify the number of PEs (Charm++ worker threads) per process. Prevous versions required the use of +p to specify the total number of PEs, but the new ++n option is now recommended. Thus, to run one process with one communication and three worker threads on each of four quad-core nodes one would specify: charmrun namd2 ++n 4 ++ppn 3

For MPI-based SMP builds one would specify any mpiexec options needed for the required number of processes and pass +ppn to the NAMD binary as: mpiexec -n 4 namd2 +ppn 3

heatherkellyucl avatar Feb 02 '22 14:02 heatherkellyucl

When in doubt, check Compute Canada: https://docs.computecanada.ca/wiki/NAMD/en#Parallel_GPU_jobs

They use OFI GPU on their OmniPath interconnect machine and UCX GPU on Infiniband machines.

If we need to do that, we aren't going to be able to test an OPA GPU build before the hardware arrives. (Can test a CPU-only parallel OFI build on Young).

heatherkellyucl avatar Feb 02 '22 16:02 heatherkellyucl

Top priority is now to get NAMD OFI CPU working on Young with gerun using charmrun - then we can make the OFI CUDA one work when the GPUs exist.

heatherkellyucl avatar Feb 02 '22 16:02 heatherkellyucl

Notes on charmrun and SGE: https://www.ks.uiuc.edu/Research/namd/wiki/index.cgi?NamdOnGridEngine

heatherkellyucl avatar Feb 03 '22 11:02 heatherkellyucl

This is quite handy, about building multiple versions and comparing them: https://docs.hpc.wvu.edu/text/609.CHARM++_NAMD.html

Note: in our Linux-x86_64-icc.arch we have -qopenmp-simd (and so does the link above, which are CPU versions). Compute Canada's builds set -qno-openmp-simd (https://github.com/ComputeCanada/easybuild-easyconfigs/tree/computecanada-main/easybuild/easyconfigs/n/NAMD). I don't know which we want.

(Currently have a charm++ ofi-smp built on Young, building namd).

heatherkellyucl avatar Feb 03 '22 15:02 heatherkellyucl

On Young:

  • [x] NAMD 2.14 ofi-smp with openmp-simd
  • [x] benchmarking jobs
  • [x] NAMD 2.14 ofi-smp without openmp-simd
  • [x] benchmarking jobs

On Myriad:

  • [x] NAMD 2.14 multicore CUDA binary
  • [ ] benchmarking jobs
  • [ ] NAMD 2.14 multicore CUDA from source (not much point, really)
  • [ ] benchmarking jobs

Extras

  • [ ] NAMD 2.14 ucx versions on Myriad (should have, is less urgent)

Run the apoa1 benchmark twice in the same job to remove the initial FFT optimisation when comparing timings, and look at the second set of times.

The smp version takes one extra thread for managing communication reducing the amount of usable cores but allows to share memory efficiently allowing for computations with larger systems. That is why having the smp and non-smp versions is recommended.

Should probably also have ofi and ucx versions without smp for this reason.

heatherkellyucl avatar Feb 04 '22 10:02 heatherkellyucl

Submitted a job with the first ofi-smd version.

heatherkellyucl avatar Feb 04 '22 16:02 heatherkellyucl

Young, apoa1, namd_ofismp_nosimd_12, end of second run

Running on 6 processors:  namd2 apoa1.namd ++ppn2 
charmrun>  /bin/setarch x86_64 -R  mpirun -np 6  namd2 apoa1.namd ++ppn2 
Charm++>ofi> provider: psm2
Charm++>ofi> control progress: 2
Charm++>ofi> data progress: 2
Charm++>ofi> maximum inject message size: 64
Charm++>ofi> eager maximum message size: 65536 (maximum header size: 40)
Charm++>ofi> cq entries count: 8
Charm++>ofi> use inject: 1
Charm++>ofi> maximum rma size: 4294967295
Charm++>ofi> mr mode: 0x1
Charm++>ofi> use memory pool: 0
Charm++>ofi> use request cache: 0
Charm++>ofi> number of pre-allocated recvs: 8
Charm++>ofi> exchanging addresses over OFI
Charm++> Running in SMP mode: 6 processes, 2 worker threads (PEs) + 1 comm threads per process, 12 PEs total
Charm++> The comm. thread both sends and receives messages
Charm++> Using recursive bisection (scheme 3) for topology aware partitions
Converse/Charm++ Commit ID: v6.10.2-0-g7bf00fa-namd-charm-6.10.2-build-2020-Aug-05-556
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 hosts (2 sockets x 20 cores x 1 PUs = 40-way SMP)
Charm++> cpu topology info is gathered in 0.004 seconds.

Info: Benchmark time: 12 CPUs 0.0453018 s/step 0.524327 days/ns 892.32 MB memory
TIMING: 500  CPU: 23.5762, 0.0448887/step  Wall: 23.6368, 0.0450001/step, 0 hours remaining, 892.320312 MB of memory in use.
ETITLE:      TS           BOND          ANGLE          DIHED          IMPRP               ELECT            VDW       BOUNDARY           MISC        KINETIC               TOTAL           TEMP      POTENTIAL         TOTAL3        TEMPAVG            PRESSURE      GPRESSURE         VOLUME       PRESSAVG      GPRESSAVG

ENERGY:     500     20974.8941     19756.6582      5724.4523       179.8271        -337741.4181     23251.1002         0.0000         0.0000     45359.0771        -222495.4091       165.0039   -267854.4862   -222061.0909       165.0039          -3197.5173     -2425.4144    921491.4634     -3197.5173     -2425.4144

WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 500
WRITING COORDINATES TO OUTPUT FILE AT STEP 500
The last position output (seq=-2) takes 0.006 seconds, 896.949 MB of memory in use
WRITING VELOCITIES TO OUTPUT FILE AT STEP 500
The last velocity output (seq=-2) takes 0.012 seconds, 896.949 MB of memory in use
====================================================

WallClock: 25.280815  CPUTime: 25.181173  Memory: 896.949219 MB

namd_ofismp_12

Running on 6 processors:  namd2 apoa1.namd ++ppn2 
charmrun>  /bin/setarch x86_64 -R  mpirun -np 6  namd2 apoa1.namd ++ppn2 
Charm++>ofi> provider: psm2
Charm++>ofi> control progress: 2
Charm++>ofi> data progress: 2
Charm++>ofi> maximum inject message size: 64
Charm++>ofi> eager maximum message size: 65536 (maximum header size: 40)
Charm++>ofi> cq entries count: 8
Charm++>ofi> use inject: 1
Charm++>ofi> maximum rma size: 4294967295
Charm++>ofi> mr mode: 0x1
Charm++>ofi> use memory pool: 0
Charm++>ofi> use request cache: 0
Charm++>ofi> number of pre-allocated recvs: 8
Charm++>ofi> exchanging addresses over OFI
Charm++> Running in SMP mode: 6 processes, 2 worker threads (PEs) + 1 comm threads per process, 12 PEs total
Charm++> The comm. thread both sends and receives messages
Charm++> Using recursive bisection (scheme 3) for topology aware partitions
Converse/Charm++ Commit ID: v6.10.2-0-g7bf00fa-namd-charm-6.10.2-build-2020-Aug-05-556
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 hosts (2 sockets x 20 cores x 1 PUs = 40-way SMP)
Charm++> cpu topology info is gathered in 0.004 seconds.

Info: Benchmark time: 12 CPUs 0.0471704 s/step 0.545953 days/ns 755.539 MB memory
TIMING: 500  CPU: 24.6084, 0.0468049/step  Wall: 24.6676, 0.0468979/step, 0 hours remaining, 755.539062 MB of memory in use.
ETITLE:      TS           BOND          ANGLE          DIHED          IMPRP               ELECT            VDW       BOUNDARY           MISC        KINETIC               TOTAL           TEMP      POTENTIAL         TOTAL3        TEMPAVG            PRESSURE      GPRESSURE         VOLUME       PRESSAVG      GPRESSAVG

ENERGY:     500     20974.8944     19756.6576      5724.4523       179.8271        -337741.4177     23251.0995         0.0000         0.0000     45359.0774        -222495.4094       165.0039   -267854.4868   -222061.0912       165.0039          -3197.5178     -2425.4147    921491.4634     -3197.5178     -2425.4147

WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 500
WRITING COORDINATES TO OUTPUT FILE AT STEP 500
The last position output (seq=-2) takes 0.026 seconds, 760.379 MB of memory in use
WRITING VELOCITIES TO OUTPUT FILE AT STEP 500
The last velocity output (seq=-2) takes 0.006 seconds, 760.379 MB of memory in use
====================================================

WallClock: 25.794182  CPUTime: 25.666487  Memory: 760.378906 MB

heatherkellyucl avatar Feb 08 '22 17:02 heatherkellyucl

Trying to get them working multi-node, I am somewhat confused: the charmrun that NAMD has installed is calling mpirun itself, and so the examples that suggest doing this kind of thing and making a charm-suitable hostfile are not relevant:

# Convert SGE hostfile to charmrun hostfile.
nodefile=namd2.${JOB_ID}.nodelist
echo group main > $nodefile
awk '{ for (i=0;i<$2;++i) {print "host",$1} }' $PE_HOSTFILE >> $nodefile

charmrun ++remote-shell ssh ++nodelist $nodefile +p${NSLOTS} namd2 apoa1.namd ++ppn2 +setcpuaffinity

If you do those you get errors because it passes the ++remote-shell (or ++nodelist if you remove that) straight through to mpirun, which thinks it is an executable it is meant to run...

Oh well. We have mpirun as the launcher.

heatherkellyucl avatar Feb 15 '22 15:02 heatherkellyucl

Need to try getting qrsh to launch the correct number of processes on the nodes.

heatherkellyucl avatar Feb 16 '22 09:02 heatherkellyucl

Or not, actually - https://dl.acm.org/doi/pdf/10.1145/3219104.3219134 has been more useful than the main docs in terms of how things are meant to be run (there's a lot of info in the main docs, but ofi and especially ofi-cuda builds are fairly niche so aren't used as examples, and the segments given for each separately don't necessarily fit together, like the no mpi-smp for cuda builds part).

Builds based on MPI, Cray GNI, OFI, and IBM PAMI are launched the same as are MPI programs on the machine (mpirun, mpiexec, ibrun, aprun, jsrun, etc.).

Which does explain why we got a charmrun that does not have all the usual options used in examples.

heatherkellyucl avatar Feb 16 '22 11:02 heatherkellyucl

I think we're having a process mapping issue at the moment. (Charm's terminology is confusing).

I stuck in some extra echoes into a charmrun-verbose It is doing the processes vs threads division itself ($NSLOTS is 80 here).

pes orig: 80, ppn: 2
pes after: 40

Running on 40 processors:  -machinefile /tmpdir/job/556150.undefined/machines  namd2 apoa1.namd ++ppn2 +setcpuaffinity 
mpirun_cmd: /shared/ucl/apps/intel/2019.Update5/impi/2019.5.281/intel64/bin/mpirun

charmrun>  /bin/setarch x86_64 -R  mpirun -np 40  -machinefile /tmpdir/job/556150.undefined/machines  namd2 apoa1.namd ++ppn2 +setcpuaffinity 

Charm++> Running in SMP mode: 40 processes, 2 worker threads (PEs) + 1 comm threads per process, 80 PEs total
Charm++> The comm. thread both sends and receives messages
Charm++> Using recursive bisection (scheme 3) for topology aware partitions
Converse/Charm++ Commit ID: v6.10.2-0-g7bf00fa-namd-charm-6.10.2-build-2020-Aug-05-556
CharmLB> Load balancer assumes all CPUs are same.
Charm++> cpu affinity enabled. 
Charm++> Running on 1 hosts (2 sockets x 20 cores x 1 PUs = 40-way SMP)
Charm++> cpu topology info is gathered in 0.291 seconds.

Charm++> Warning: the number of SMP threads (120) is greater than the number of physical cores (80), so threads will sleep while idling. Use +CmiSpinOnIdle or +CmiSl
eepOnIdle to control this directly.

WARNING: Multiple PEs assigned to same core, recommend adjusting processor affinity or passing +CmiSleepOnIdle to reduce interference.

Info: 1 NAMD  2.14  Linux-x86_64-ofi-smp  80    node-c12f-001  cceahke
Info: Running on 80 processors, 40 nodes, 1 physical nodes.

Charm++ uses an unconventional internal nomenclature that may appear in NAMD startup and error messages. A Charm++ “PE” (processing element) is a worker thread (typically a POSIX thread running on a dedicated hardware thread). A Charm++ “node” is a process (a set of PEs sharing a memory space). A Charm++ “physical node” is a host (running a set of Charm++ nodes that share network interfaces and GPUs, and which are assumed to communicate faster among themselves than with other hosts).

For smp builds on the above platforms [MPI, OFI etc], the number of worker threads (PEs) per process must be specifed as +ppn threads, e.g., “mpiexec -n 4 namd2 +ppn 7 ...” would launch 4 processes, each with 7 worker threads plus a communication thread, thus using a total of 32 cores (or hardware threads). Care must be taken to specify to the platform launch system the total number of threads (worker plus communication) for each process so that sufficient cores are reserved and affinity set, otherwise all of the NAMD threads may end up sharing a single core. Note that the PAMI and multicore platforms lack a separate communication thread and may thus use all cores for computation. When running multi-copy algorithms with NAMD, if each replica is a single process then the communication thread will sleep when idle and thus also does not require a dedicated core.

https://github.com/UIUC-PPL/charm/issues/2059 is about hwloc issues when setting +pemap and +comap for jobs, but does show some where it is incorrectly running on 1 host vs correctly running on 2 hosts. So we need to set something to get this using 2 hosts.

heatherkellyucl avatar Feb 16 '22 12:02 heatherkellyucl

(Was also wondering if it might do better automatically with the $TMPDIR/machines.unique instead since I loaded an Intel MPI here).

heatherkellyucl avatar Feb 16 '22 12:02 heatherkellyucl

I should check what the ofi-only version with no smp does on two nodes, then come back to this one.

heatherkellyucl avatar Feb 16 '22 13:02 heatherkellyucl

Ahahah, ofi-smp job that was already in the queue with machines.unique has worked:

Charm++> Running on 2 hosts (2 sockets x 20 cores x 1 PUs = 40-way SMP)

Info: 1 NAMD  2.14  Linux-x86_64-ofi-smp  80    node-c12i-001  cceahke
Info: Running on 80 processors, 40 nodes, 2 physical nodes.

and that was

charmrun -machinefile $TMPDIR/machines.unique +p${NSLOTS} namd2 apoa1.namd ++ppn2 +setcpuaffinity

heatherkellyucl avatar Feb 16 '22 15:02 heatherkellyucl

Did some comparisons of options (4 repeats of each in one job, ignoring the results from the first where it does the ffts). Yes, +setcpuaffinity for smp jobs! I wanted to make sure it was still helpful if you don't give any more explicit mappings.

80 cores (ofi-smp, charmrun +p${NSLOTS} namd2 apoa1.namd ++ppn2 +setcpuaffinity):

The last velocity output (seq=-2) takes 0.002 seconds, 789.320 MB of memory in use
WallClock: 104.270256  CPUTime: 86.148232  Memory: 789.320312 MB

The last velocity output (seq=-2) takes 0.002 seconds, 783.336 MB of memory in use
WallClock: 104.243652  CPUTime: 84.784187  Memory: 783.335938 MB

The last velocity output (seq=-2) takes 0.003 seconds, 783.262 MB of memory in use
WallClock: 105.437546  CPUTime: 85.179848  Memory: 783.261719 MB

80 cores (ofi-smp, charmrun +p${NSLOTS} namd2 apoa1.namd ++ppn2):

The last velocity output (seq=-2) takes 0.003 seconds, 789.391 MB of memory in use
WallClock: 148.292145  CPUTime: 101.689720  Memory: 789.390625 MB

The last velocity output (seq=-2) takes 0.003 seconds, 789.121 MB of memory in use
WallClock: 147.651932  CPUTime: 103.590492  Memory: 789.121094 MB

The last velocity output (seq=-2) takes 0.003 seconds, 792.082 MB of memory in use
WallClock: 152.005264  CPUTime: 101.689995  Memory: 792.082031 MB

heatherkellyucl avatar Feb 18 '22 09:02 heatherkellyucl

The simd vs nosimd is inconclusive so may as well leave it on.

heatherkellyucl avatar Feb 18 '22 10:02 heatherkellyucl

Final (CPU) builds on Young:

  • [x] ofi-smp
  • [x] ofi
  • [x] charmrun wrapper(s)
  • [x] modulefiles
  • [x] user docs

heatherkellyucl avatar Feb 18 '22 10:02 heatherkellyucl

Testing the charmrun wrapper.

heatherkellyucl avatar Feb 21 '22 16:02 heatherkellyucl

Install on:

  • [x] Young
  • [x] Kathleen
  • [x] Thomas (for Michael)

Modules are:

module unload -f compilers mpi
module load compilers/intel/2019/update5
module load mpi/intel/2019/update5/intel
module load namd/2.14/ofi/intel-2019
module unload -f compilers mpi
module load compilers/intel/2019/update5
module load mpi/intel/2019/update5/intel
module load namd/2.14/ofi-smp/intel-2019

heatherkellyucl avatar Feb 22 '22 11:02 heatherkellyucl

Having an extra test job submitted using the central install on Young, just in case. (Worked).

heatherkellyucl avatar Feb 22 '22 11:02 heatherkellyucl

Variety of job sizes also in the queue so can have a vague benchmark of up to and including 5 nodes.

heatherkellyucl avatar Feb 23 '22 15:02 heatherkellyucl