pair_allegro
pair_allegro copied to clipboard
"Non-numeric pressure" error (pressure = -nan) when attempting NPT in LAMMPS
I'm trying to do NPT calculations in LAMMPS using pair_allegro, but the Allegro model I trained is predicting -nan for the system pressure, so the NPT run fails. If I run under NVE or NVT, including the press property in the LAMMPS thermo logging, I see output like this:
Step Time Temp Press
0 0 450 -nan
1 0.001 450.54175 -nan
2 0.002 444.03951 -nan
3 0.003 440.5265 -nan
If I attempt a NPT calculation, like this
# Run MD in the NPT ensemble, with a Nosé-Hoover thermostat starting at 450.0 K.
units metal
fix mynose all npt &
temp 450.0 450.0 0.05 &
tchain 3 &
iso 1.0 1.0 0.3
run 1000
I immediately get this error:
Setting up Verlet run ...
Unit style : metal
Current step : 0
Time step : 0.0005
ERROR: Non-numeric pressure - simulation unstable (src/fix_nh.cpp:1049)
I am training on a ASE dataset with stress information stored in units of energy / length^2:
In [11]: from ase.io import read
In [12]: traj = read("./trainval.traj", index=0)
In [13]: traj.get_stress()
Out[13]:
array([ 0.00203341, 0.00251201, 0.0009121 , -0.00069527, 0.00031445,
-0.00031556])
and I can confirm that when using nequip-evaluate for inference on this same deployed model, I do get predicted stresses in the output file.
$ nequip-evaluate --train-dir <my training dir> --test-indexes test-indices.json --model test-64bit-deployed-model.pth --output output.xyz
$ head -n 3 output.xyz
440
Lattice="16.44719886779785 0.0 0.0 0.0 16.44719886779785 0.0 0.0 0.0 16.44719886779785" Properties=species:S:1:pos:R:3:energies:R:1:forces:R:3 original_dataset_index=1 energy=-2975.787905704823 stress="0.0008169945453196768 -0.00012941664369788713 0.0009707167252817042 -0.00012941664369788713 0.0004451946509430406 -0.0010155815638519358 0.0009707167252817042 -0.0010155815638519358 -0.00377564117326529" pbc="F F F"
O -2.82277608 -6.28169346 9.18250656 -6.48192610 0.01213487 -0.06158887 -0.59497979
This problem happens no matter whether I use default_dtype: float32 in the config (and pair_style allegro3232 in the LAMMPS script) or default_dtype: float64 in the config (and pair_style allegro in the LAMMPS script). I am using NequIP 0.6.1, mir-allegro 0.2.0, and PyTorch 1.11.0 (CUDA 11.3, cuDNN 8.2.0). I compiled LAMMPS 02Aug23 using pair_allegro commit 20538c9, which is the commit introducing support for stress. Details of my compilation of LAMMPS, an example of my training config (except the default_dtype setting), and an example NVT LAMMPS input script are here.
I am forcing deletion of any overlapping atoms prior to the NPT run, and do not see any indication when running under NVT or NVE that atoms are too close together, have very high forces, or are otherwise causing the simulation to go unstable. If I switch my LAMMPS input to use pair_style lj/cut, I am able to observe pressures in the thermo output
Is there something obvious I'm missing about how to get pair_allegro to pass the stress predictions from my Allegro models to LAMMPS?
Thanks!
Hi @samueldyoung29ctr ,
Hm, odd. If you start a simulation in LAMMPS from exactly a frame that shows up in your training data, or the one that shows a stress tensor in nequip-evaluate, do you get the expected result at least on that first frame?
All of the starting geometries I launch from LAMMPS appear to show -nan pressure, even on the first step. I tried printing the virial tensor elements directly as they come from the torch model output, just after they are passed to LAMMPS.
Diagnostic printing of virial tensor elements
pair_allegro.cpp:
if (vflag) {
torch::Tensor v_tensor = output.at("virial").toTensor().cpu();
auto v = v_tensor.accessor<outputtype, 3>();
// Convert from 3x3 symmetric tensor format, which NequIP outputs, to the flattened form LAMMPS expects
// First [0] index on v is batch
virial[0] = v[0][0][0];
virial[1] = v[0][1][1];
virial[2] = v[0][2][2];
virial[3] = v[0][0][1];
virial[4] = v[0][0][2];
virial[5] = v[0][1][2];
+ std::cout << "\tVirial Voigt vector: " << std::to_string(virial[0]) << ", " << std::to_string(virial[1]) << ", " << std::to_string(virial[2]) << ", " << std::to_string(virial[3]) << ", " << std::to_string(virial[4]) << ", " << std::to_string(virial[5]) << ".\n";
}
pair_allegro_kokkos.cpp:
if(vflag){
torch::Tensor v_tensor = output.at("virial").toTensor().cpu();
auto v = v_tensor.accessor<outputtype, 3>();
// Convert from 3x3 symmetric tensor format, which NequIP outputs, to the flattened form LAMMPS expects
// First [0] index on v is batch
this->virial[0] = v[0][0][0];
this->virial[1] = v[0][1][1];
this->virial[2] = v[0][2][2];
this->virial[3] = v[0][0][1];
this->virial[4] = v[0][0][2];
this->virial[5] = v[0][1][2];
+ std::cout << "\tVirial Voigt vector: " << std::to_string(this->virial[0]) << ", " << std::to_string(this->virial[1]) << ", " << std::to_string(this->virial[2]) << ", " << std::to_string(this->virial[3]) << ", " << std::to_string(this->virial[4]) << ", " << std::to_string(this->virial[5]) << ".\n";
}
All virial components are -nan in the Kokkos-accelerated version, leading to the -nan pressure.
Outputs from running 10 steps of NVE
using pair_allegro:
Per MPI rank memory allocation (min/avg/max) = 4.907 | 4.907 | 4.907 Mbytes
Step Time Temp Press PotEng KinEng TotEng E_pair E_bond Econserve Fmax
0 0 450 7123.5399 -23812.81 204.6899 -23608.121 -23812.81 0 -23608.121 4.6981153
Virial Voigt vector: 29.758970, 45.364822, 7.063734, 3.145753, -8.978357, -9.596899.
1 0.0005 451.08768 7333.0305 -23813.316 205.18465 -23608.132 -23813.316 0 -23608.132 5.067274
Virial Voigt vector: 39.200824, 45.856633, 13.364390, 5.394424, -21.287504, -7.662407.
2 0.001 451.81109 7584.5196 -23813.664 205.5137 -23608.15 -23813.664 0 -23608.15 4.8237129
Virial Voigt vector: 49.247942, 45.048843, 23.481787, 7.260359, -31.039119, -4.989960.
3 0.0015 451.6869 7871.0145 -23813.617 205.45722 -23608.159 -23813.617 0 -23608.159 4.4277271
Virial Voigt vector: 56.195390, 42.077853, 37.043569, 7.770809, -36.851748, -2.044238.
4 0.002 450.46312 8115.5436 -23813.056 204.90056 -23608.155 -23813.056 0 -23608.155 4.4915488
Virial Voigt vector: 59.136834, 37.968314, 54.470379, 6.436822, -38.168621, 0.849517.
5 0.0025 448.37065 8329.2582 -23812.081 203.94876 -23608.133 -23812.081 0 -23608.133 4.3433504
Virial Voigt vector: 57.057529, 36.101378, 73.077232, 3.659967, -35.406512, 3.224243.
6 0.003 446.1616 8517.6018 -23811.049 202.94394 -23608.105 -23811.049 0 -23608.105 4.5180411
Virial Voigt vector: 48.109272, 36.321093, 88.279174, 0.041631, -27.643993, 4.904825.
7 0.0035 444.84745 8596.1772 -23810.426 202.34618 -23608.08 -23810.426 0 -23608.08 5.0060566
Virial Voigt vector: 33.680779, 37.114159, 98.805940, -2.907780, -15.261765, 3.495411.
8 0.004 445.1429 8553.8979 -23810.565 202.48057 -23608.085 -23810.565 0 -23608.085 5.2285217
Virial Voigt vector: 19.102289, 39.672924, 101.937126, -5.476006, -0.462937, 0.214057.
9 0.0045 446.96257 8446.2134 -23811.421 203.30828 -23608.113 -23811.421 0 -23608.113 5.5230971
Virial Voigt vector: 9.956778, 44.490638, 97.406353, -8.867623, 14.901005, -3.847867.
10 0.005 449.42526 8347.684 -23812.577 204.42847 -23608.148 -23812.577 0 -23608.148 5.7109114
Loop time of 4.88647 on 1 procs for 10 steps with 3520 atoms
and I can do NPT.
Output when running pair_allegro/kk:
Per MPI rank memory allocation (min/avg/max) = 3.84 | 3.84 | 3.84 Mbytes
Step Time Temp Press PotEng KinEng TotEng E_pair E_bond Econserve Fmax
0 0 450 -nan -23812.81 204.6899 -23608.121 -23812.81 0 -23608.121 4.6981153
Virial Voigt vector: -nan, -nan, -nan, -nan, -nan, -nan.
1 0.0005 451.08768 -nan -23813.316 205.18465 -23608.132 -23813.316 0 -23608.132 5.067274
Virial Voigt vector: -nan, -nan, -nan, -nan, -nan, -nan.
2 0.001 451.81109 -nan -23813.664 205.5137 -23608.15 -23813.664 0 -23608.15 4.8237129
Virial Voigt vector: -nan, -nan, -nan, -nan, -nan, -nan.
3 0.0015 451.6869 -nan -23813.617 205.45722 -23608.159 -23813.617 0 -23608.159 4.4277271
Virial Voigt vector: -nan, -nan, -nan, -nan, -nan, -nan.
4 0.002 450.46312 -nan -23813.056 204.90056 -23608.155 -23813.056 0 -23608.155 4.4915488
Virial Voigt vector: -nan, -nan, -nan, -nan, -nan, -nan.
5 0.0025 448.37065 -nan -23812.081 203.94877 -23608.133 -23812.081 0 -23608.133 4.3433504
Virial Voigt vector: -nan, -nan, -nan, -nan, -nan, -nan.
6 0.003 446.1616 -nan -23811.049 202.94394 -23608.105 -23811.049 0 -23608.105 4.5180411
Virial Voigt vector: -nan, -nan, -nan, -nan, -nan, -nan.
7 0.0035 444.84745 -nan -23810.426 202.34618 -23608.08 -23810.426 0 -23608.08 5.0060566
Virial Voigt vector: -nan, -nan, -nan, -nan, -nan, -nan.
8 0.004 445.1429 -nan -23810.565 202.48057 -23608.085 -23810.565 0 -23608.085 5.2285217
Virial Voigt vector: -nan, -nan, -nan, -nan, -nan, -nan.
9 0.0045 446.96257 -nan -23811.421 203.30828 -23608.113 -23811.421 0 -23608.113 5.5230971
Virial Voigt vector: -nan, -nan, -nan, -nan, -nan, -nan.
10 0.005 449.42526 -nan -23812.577 204.42847 -23608.148 -23812.577 0 -23608.148 5.7109114
Loop time of 4.88749 on 1 procs for 10 steps with 3520 atoms
and the same error message when attempting NPT:
Setting up Verlet run ...
Unit style : metal
Current step : 10
Time step : 0.0005
Virial Voigt vector: -nan, -nan, -nan, -nan, -nan, -nan.
ERROR: Non-numeric pressure - simulation unstable (src/fix_nh.cpp:1049)
Looks like the workaround is to use the non-Kokkos pair_allegro for now. I am building and linking against cxx11 libtorch 2.0.0-cpuonly, as this is what NERSC's build uses. My build of LAMMPS is CPU-only due to resource constraints on the cluster I use. I do see in pair_allegro.cpp that there is an additional compute_custom_tensor packed into the model input that doesn't exist in the Kokkos version---perhaps this has something to do with enabling virial model outputs?
https://github.com/mir-group/pair_allegro/blob/20538c9fd308bd0d066a716805f6f085a979c741/pair_allegro.cpp#L451-L455
Happy to do some more testing if you'd like.
My build of LAMMPS is CPU-only due to resource constraints on the cluster I use.
Aha. I don't think we've ever actually tested Kokkos pair_allegro on CPU, nor am I sure we'd expect it to have any benefits over the OpenMP pair_allegro "plain" on CPU. @anjohan thoughts?
Still, I guess we would have expected it to work...
compute_custom_tensor should not be relevant for virials.
Also, just to clarify, we should be training Allegro models on stresses in units of energy / length^3, right? E.g., for LAMMPS metal units, we should train Allegro on stresses in units of eV/ang^3, not units of bar?
Update: I got around to compiling LAMMPS with CUDA support, but am still seeing this issue when using Kokkos to utilize the GPUs (NVIDIA A100-SXM4-40GB GPUs, CUDA 12.4 drivers installed).
- LAMMPS version: 02Aug23 (commit 46265e36c in official LAMMPS repo)
- pair_allegro version: commit 20538c9 (#43)
- Build environment:
- CUDA 11.6 (CUDA 12.4 installation is broken on my cluster)
- GCC 10 compilers
- AMD AOCL 4.0, providing non-Kokkos FFTW3, compiled by GCC
- NVIDIA cuFFT (from CUDA 11.6), providing Kokkos FFTs
- Penguin TrueHPC OpenMPI 4.1.4, compiled by GCC
- Intel MKL 2024.1.0 (not used for FFTs, but just to provide bindings to allow compilation to succeed)
- NVIDIA cuDNN 9.2.0.82_cuda11
- NVIDIA libtorch 1.13.1+cu116-cxx11-abi (static, with deps)
My CMake config looks like this:
CMake config
cmake ../cmake \
-D LAMMPS_EXCEPTIONS=ON \
-D BUILD_SHARED_LIBS=ON \
-D BUILD_MPI=yes \
-D BUILD_OMP=yes \
-C ../cmake/presets/gcc.cmake \
-C ../cmake/presets/kokkos-cuda.cmake \
-D PKG_KOKKOS=yes \
-D Kokkos_ARCH_ZEN3=yes \
-D Kokkos_ARCH_PASCAL60=no \
-D Kokkos_ARCH_AMPERE80=yes \
-D Kokkos_ENABLE_CUDA=yes \
-D Kokkos_ENABLE_OPENMP=yes \
-D CUFFT_LIBRARY=$CUDA_HOME/lib64/libcufft.so \
-D CUDA_INCLUDE_DIRS=$CUDA_HOME/include \
-D CUDA_CUDART_LIBRARY=$CUDA_HOME/lib64/libcudart.so \
-D CAFFE2_USE_CUDNN=ON \
-D BUILD_TOOLS=no \
-D FFT=FFTW3 \
-D FFT_KOKKOS=CUFFT \
-D FFTW3_INCLUDE_DIR=$AOCL_ROOT/include \
-D FFTW3_LIBRARY=$AOCL_LIB/libfftw3.so \
-D FFTW3_OMP_LIBRARY=$AOCL_LIB/libfftw3_omp.so \
-D CMAKE_INSTALL_PREFIX="$LAMMPS_ROOT" \
-D PKG_MANYBODY=yes \
-D PKG_MOLECULE=yes \
-D PKG_KSPACE=yes \
-D PKG_REPLICA=yes \
-D PKG_ASPHERE=yes \
-D PKG_RIGID=yes \
-D PKG_MPIIO=yes \
-D PKG_COMPRESS=yes \
-D PKG_H5MD=no \
-D PKG_OPENMP=yes \
-D CMAKE_POSITION_INDEPENDENT_CODE=yes \
-D CMAKE_EXE_FLAGS="-dynamic" \
-D CMAKE_VERBOSE_MAKEFILE=TRUE \
I am building a non-debug version of LAMMPS because the build fails when I try to enable debugging symbols, no matter whether I use GCC or NVHPC compilers.
The Allegro model I am using was trained using NequIP 0.6.1, mir-allegro 0.2.0, and PyTorch 1.11.0 (py3.10_cuda11.3_cudnn8.2.0_0). It was also trained on a NVIDIA A100-SXM4-40GB GPU.
Allegro training config
BesselBasis_trainable: true
PolynomialCutoff_p: 48
append: true
ase_args:
format: traj
avg_num_neighbors: auto
batch_size: 1
chemical_symbols:
- H
- O
dataset: ase
dataset_file_name: <path to train+val dataset as .traj file>
dataset_seed: 123456
default_dtype: float64
early_stopping_lower_bounds:
LR: 1.0e-05
early_stopping_patiences:
validation_loss: 100
early_stopping_upper_bounds:
cumulative_wall: 604800.0
edge_eng_mlp_initialization: uniform
edge_eng_mlp_latent_dimensions:
- 32
edge_eng_mlp_nonlinearity: null
ema_decay: 0.999
ema_use_num_updates: true
embed_initial_edge: true
env_embed_mlp_initialization: uniform
env_embed_mlp_latent_dimensions: []
env_embed_mlp_nonlinearity: null
env_embed_multiplicity: 64
l_max: 2
latent_mlp_initialization: uniform
latent_mlp_latent_dimensions:
- 64
- 64
- 64
- 64
latent_mlp_nonlinearity: silu
latent_resnet: true
learning_rate: 0.001
loss_coeffs:
forces:
- 1
- PerSpeciesL1Loss
stress: 5000
total_energy:
- 20.0
- PerAtomL1Loss
lr_scheduler_kwargs:
cooldown: 0
eps: 1.0e-08
factor: 0.9
min_lr: 0
mode: min
patience: 400
threshold: 0.0001
threshold_mode: rel
verbose: false
lr_scheduler_name: ReduceLROnPlateau
max_epochs: 50000
metrics_components:
- - forces
- mae
- PerSpecies: true
report_per_component: false
- - forces
- rmse
- PerSpecies: true
report_per_component: false
- - forces
- rmse
- - forces
- mae
- - total_energy
- mae
- PerAtom: true
- - total_energy
- mae
- PerAtom: true
- - total_energy
- rmse
- PerAtom: true
- - stress
- mae
- - stress
- rmse
metrics_key: validation_loss
model_builders:
- allegro.model.Allegro
- PerSpeciesRescale
- StressForceOutput
- RescaleEnergyEtc
n_train: 2250
n_val: 250
num_layers: 2
optimizer_kwargs:
amsgrad: false
betas: !!python/tuple
- 0.9
- 0.999
eps: 1.0e-08
weight_decay: 0.0
optimizer_name: Adam
parity: o3_full
r_max: 4.0
root: results/wateronly
run_name: hpo-2f39852b9d648fa732723543b02d3ca4c3581ddc
seed: 123456
shuffle: true
train_val_split: random
two_body_latent_mlp_initialization: uniform
two_body_latent_mlp_latent_dimensions:
- 32
- 64
two_body_latent_mlp_nonlinearity: silu
use_ema: true
verbose: debug
wandb: true
wandb_project: <project name>
After deploying the best model to TorchScript format, I attempted to use it in a LAMMPS NPT simulation. The input geometry is a water-only system.
geom-thermalized-298.15K.data
LAMMPS data file via write_data, version 2 Aug 2023, timestep = 10000, units = metal
5184 atoms
2 atom types
0 38.7494136435 xlo xhi
0 38.7494136435 ylo yhi
0 38.7494136435 zlo zhi
Masses
1 1.008
2 15.999
Atoms # atomic
4014 1 2.016863143251884 1.935372548383521 4.290501263898246 1 0 0
3979 2 2.5980102214720313 2.127124762222952 3.5158172054292423 1 0 0
3980 1 2.426726184113488 1.330444453601437 2.933484352485481 1 0 0
4235 1 0.4427546772570891 1.8248251034116987 0.3139859162106747 1 0 1
306 2 4.479768747024378 3.0437024573024676 1.0988955165290661 0 0 1
18 1 3.7138355554631155 2.7497739123432603 0.5598095055376706 0 0 0
10 2 5.838633127133152 1.38484923725688 2.427887300431939 0 0 0
1003 1 5.019716962999001 1.8773596976102618 2.24263102951684 0 1 0
57 1 5.1020184395333015 0.2780756937364116 4.697073480526178 0 0 0
58 1 4.189152573156522 1.4148146498389342 5.047283101683429 0 0 0
11 1 6.320533570258894 2.2046199287416948 2.4275733453782147 0 0 0
...
There are no atom overlaps in this geometry, and the LAMMPS input script attempts to do NPT.
input.lammps
# LAMMPS script for our MD systems to validate Allegro potentials
# System-wide settings
units metal
dimension 3
atom_style atomic
boundary p p p
# System geometry
# initial_frame.data will be written into the working directory where this
# script is located.
read_data ./geom-thermalized-298.15K.data
# Simulation settings
mass 1 1.008
mass 2 15.999
pair_style allegro
pair_coeff * * ./hpo-2f39852b9d648fa732723543b02d3ca4c3581ddc.pth H O
# PART B - MOLECULAR DYNAMICS
delete_atoms overlap 0.1 all all
# Logging
thermo 1
thermo_style custom step time temp press pe ke etotal epair ebond econserve fmax
# Try to rebuild neighbor lists more often
neigh_modify every 1 delay 0 check yes binsize 10.0
# Also try to specify larger cutoff for ghost atoms to avoid losing atoms.
comm_modify mode single cutoff 10.0 vel yes
# Try specifying initial velocities for all atoms
velocity all create 298.15 3127835 dist gaussian
# Run MD in the NPT ensemble, with a Nosé-Hoover thermostat starting at 298.15 K and a barostat starting at 1.01325 bar.
fix mynose all npt &
temp 298.15 298.15 0.011 &
tchain 3 &
iso 1.01325 1.01325 0.03
# Be sure to dump the MD trajectory
dump mdtraj all atom 40 mdtraj.lammpstrj
dump mdforces all custom 40 mdforces.lammpstrj id x y z vx vy vz fx fy fz
timestep 0.0005
# Set up binary restart dumps every 1000 steps in case something goes wrong.
restart 1000 step-*.restart
# Normal run, with a single balance first
balance 1.0 shift xyz 100 1.0
run 20000
undump mdtraj
undump mdforces
# Finally, write out the final geometry of the system
write_data geom-equilibrated-1atm.data
I invoke LAMMPS with Kokkos:
srun --cpu-bind=cores --gpu-bind=none lmp -k on g 4 -sf kk -pk kokkos neigh full newton on -in input.lammps
This results in the following error:
LAMMPS stdout
Module cudnn/linux-x86_64-9.2.0.82_cuda11 is loading...
Module libtorch/1.13.1+cu116-cxx11-abi is loading...
Module lammps-tpc/2Aug23 is loading...
Loading lammps-tpc/2Aug23/gcc10-allegro-gpu
Loading requirement: slurm scl/gcc-toolset-10 amd/aocl/gcc/4.0 cuda/cuda-11.6
penguin/openmpi/4.1.4/gcc intel/tbb/2021.12 intel/compiler-rt/2024.1.0
intel/mkl/2024.1 intel/oneapi-2024.1.0-mkl
cudnn/linux-x86_64-9.2.0.82_cuda11 libtorch/1.13.1+cu116-cxx11-abi
LAMMPS (2 Aug 2023 - Update 3)
KOKKOS mode with Kokkos version 3.7.2 is enabled (src/KOKKOS/kokkos.cpp:108)
will use up to 4 GPU(s) per node
WARNING: Turning off GPU-aware MPI since it is not detected, use '-pk kokkos gpu/aware on' to override (src/KOKKOS/kokkos.cpp:316)
using 1 OpenMP thread(s) per MPI task
Reading data file ...
orthogonal box = (0 0 0) to (38.749414 38.749414 38.749414)
1 by 2 by 2 MPI processor grid
reading atoms ...
5184 atoms
reading velocities ...
5184 velocities
read_data CPU = 0.030 seconds
Allegro is using input precision f and output precision d
Allegro: Loading model from ./hpo-2f39852b9d648fa732723543b02d3ca4c3581ddc.pth
Allegro: Freezing TorchScript model...
Type mapping:
Allegro type | Allegro name | LAMMPS type | LAMMPS name
0 | H | 1 | H
1 | O | 2 | O
ti=0 tj=0 cut=4.00
ti=0 tj=1 cut=4.00
ti=1 tj=0 cut=4.00
ti=1 tj=1 cut=4.00
System init for delete_atoms ...
Neighbor list info ...
update: every = 1 steps, delay = 0 steps, check = yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 6
ghost atom cutoff = 6
binsize = 6, bins = 7 7 7
2 neighbor lists, perpetual/occasional/extra = 1 1 0
(1) command delete_atoms, occasional, copy from (2)
attributes: full, newton on
pair build: copy/kk/device
stencil: none
bin: none
(2) pair allegro/kk, perpetual
attributes: full, newton on, kokkos_device
pair build: full/bin/kk/device
stencil: full/bin/3d
bin: kk/device
Deleted 0 atoms, new total = 5184
Balancing ...
Neighbor list info ...
update: every = 1 steps, delay = 0 steps, check = yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 6
ghost atom cutoff = 10
binsize = 10, bins = 4 4 4
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair allegro/kk, perpetual
attributes: full, newton on, kokkos_device
pair build: full/bin/kk/device
stencil: full/bin/3d
bin: kk/device
rebalancing time: 0.004 seconds
iteration count = 26
initial/final maximal load/proc = 1369 1329
initial/final imbalance factor = 1.0563272 1.025463
x cuts: 0 1
y cuts: 0 0.4954834 1
z cuts: 0 0.49346924 1
Setting up Verlet run ...
Unit style : metal
Current step : 0
Time step : 0.0005
ERROR: Non-numeric pressure - simulation unstable (src/fix_nh.cpp:1049)
Last command: run 20000
If I instead change to NVT like this:
fix mynose all nvt &
temp 298.15 298.15 0.011 &
tchain 3 &
# iso 1.01325 1.01325 0.03
and again run with Kokkos the run starts, but with nan as the pressure:
LAMMPS stdout with NVT, using Kokkos
Setting up Verlet run ...
Unit style : metal
Current step : 0
Time step : 0.0005
Per MPI rank memory allocation (min/avg/max) = 3.543 | 3.546 | 3.549 Mbytes
Step Time Temp Press PotEng ...
0 0 298.15 nan -29784.777 ...
1 0.0005 296.93805 nan -29783.946 ...
2 0.001 296.03497 nan -29783.33 ...
3 0.0015 295.55724 nan -29783.005 ...
4 0.002 295.38153 nan -29782.879 ...
5 0.0025 295.32922 nan -29782.824 ...
If I take this same job, again using GPU LAMMPS, and run a CPU-only job without using Kokkos:
srun --cpu-bind=cores --gpu-bind=none lmp -in input.lammps
then pressures are calculated (although they are quite high):
LAMMPS stdout with NVT, no Kokkos
Setting up Verlet run ...
Unit style : metal
Current step : 0
Time step : 0.0005
Per MPI rank memory allocation (min/avg/max) = 3.539 | 3.545 | 3.551 Mbytes
Step Time Temp Press PotEng ...
0 0 298.15 4406.7395 -29784.805 ...
1 0.0005 299.15125 4408.8153 -29785.494 ...
2 0.001 300.23769 4425.8546 -29786.253 ...
3 0.0015 300.77332 4476.0844 -29786.639 ...
4 0.002 300.32951 4566.4221 -29786.356 ...
5 0.0025 298.96391 4682.4999 -29785.437 ...
6 0.003 297.23743 4794.6713 -29784.27 ...
Any advice on what to try? I have been using LAMMPS 02Aug23 since the folks at NERSC have used that version for their LAMMPS+pair_allegro installation, but is there a different LAMMPS release you recommend using?
The admins on my cluster are also going to fix CUDA 12.4, so I should be able to build against more recent CUDA and related libraries in the next few weeks.
Thanks!
Update 13 Sep 2024: The problem and nan pressures persist even when compiling with the latest LAMMPS development branch (i.e., Git commit 2995cb7 from doing git clone --depth=1 https://github.com/lammps/lammps).
Update: I also tried running calculations using the LAMMPS + Kokkos + pair_allegro installation available on NERSC Perlmutter (i.e., the nersc/lammps_allegro:23.08 Shifter image, hash 3bd7ce2e78, built on 03 Oct 2024). This time, I am getting nan values for pressures (as opposed to the -nan values of my other cluster).
NERSC told me that this image was built using the following Dockerfile:
Dockerfile for NERSC `nersc/lammps_allegro:23.08` image
FROM nvcr.io/nvidia/cuda:11.8.0-devel-ubuntu22.04
WORKDIR /opt
ENV DEBIAN_FRONTEND noninteractive
RUN \
apt-get update && \
apt-get install --yes \
build-essential \
autoconf \
cmake \
flex \
bison \
zlib1g-dev \
fftw-dev \
fftw3 \
apbs \
libicu-dev \
libbz2-dev \
libboost-all-dev \
libgmp-dev \
bc \
libblas-dev \
liblapack-dev \
libfftw3-dev \
automake \
lsb-core \
libxc-dev \
git \
unzip \
clang \
llvm \
gcc \
g++ \
libgsl-dev \
libhdf5-serial-dev \
cmake \
intel-mkl-full \
vim \
python3 \
python3-pip \
mlocate \
wget && \
apt-get clean all
ARG mpich=4.1.1
ARG mpich_prefix=mpich-$mpich
RUN \
wget https://www.mpich.org/static/downloads/$mpich/$mpich_prefix.tar.gz && \
tar xvzf $mpich_prefix.tar.gz && \
cd $mpich_prefix && \
./configure FFLAGS=-fallow-argument-mismatch FCFLAGS=-fallow-argument-mismatch && \
make -j 16 && \
make install && \
make clean && \
cd .. && \
rm -rf $mpich_prefix
RUN /sbin/ldconfig
ENV MPI_PATH=/opt/mpich/install
ENV PATH=$PATH:/opt/mpich/install/bin
ENV PATH=$PATH:/opt/mpich/install/include
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mpich/install/lib
#RUN which mpicc
#RUN env MPICC=/opt/mpich/install/bin/mpicc python3 -m pip install mpi4py
# Install miniconda
ENV installer=Miniconda3-py39_4.12.0-Linux-x86_64.sh
RUN wget https://repo.anaconda.com/miniconda/$installer && \
/bin/bash $installer -b -p /opt/miniconda3 && \
rm -rf $installer
ENV PATH=/opt/miniconda3/bin:$PATH
RUN pip install numpy scipy matplotlib setuptools
RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
RUN pip install mkl mkl-devel mkl-static mkl-include
RUN pip install ninja
RUN pip install wandb
#Installing lammps
WORKDIR /opt
RUN cd /opt
RUN git clone -b stable_2Aug2023_update2 --depth 1 https://github.com/lammps/lammps.git lammps
RUN git clone -b multicut https://github.com/mir-group/pair_allegro.git pair_allegro
RUN cd /opt/pair_allegro && \
./patch_lammps.sh /opt/lammps
RUN apt-get install --yes clang-format xxd
RUN wget https://download.pytorch.org/libtorch/cu118/libtorch-cxx11-abi-shared-with-deps-2.0.0%2Bcu118.zip
RUN unzip libtorch-cxx11-abi-shared-with-deps-2.0.0+cu118.zip
RUN rm -rf libtorch-cxx11-abi-shared-with-deps-2.0.0+cu118.zip
RUN mv libtorch libtorch-gpu
ENV PATH=$PATH:/opt/libtorch-gpu/bin
ENV PATH=$PATH:/opt/libtorch-gpu/include
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/libtorch-gpu/lib
ENV TORCH_CUDA_ARCH_LIST="8.0 8.6 8.9 9.0"
ENV CONDA_PREFIX="/opt/miniconda3"
ENV PATH=$PATH:/opt/lammps/build/plumed_build-prefix/bin
ENV PATH=$PATH:/opt/lammps/build/plumed_build-prefix/include
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/lammps/build/plumed_build-prefix/lib
ENV PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/opt/lammps/build/plumed_build-prefix/lib/pkgconfig
ENV PLUMED_KERNEL=/opt/lammps/build/plumed_build-prefix/lib/libplumedKernel.so
WORKDIR /opt/lammps
RUN mkdir build
WORKDIR /opt/lammps/build
RUN cmake -DMKL_INCLUDE_DIR=$CONDA_PREFIX/include -DMKL_LIBRARY=$CONDA_PREFIX/lib -D CMAKE_BUILD_TYPE=Release \
-D CMAKE_PREFIX_PATH=/opt/libtorch-gpu \
-D CMAKE_INSTALL_PREFIX=/opt/lammps/install -D CMAKE_CXX_STANDARD=17 -D CMAKE_CXX_STANDARD_REQUIRED=ON \
-D BUILD_MPI=ON -D CMAKE_CXX_COMPILER=/opt/lammps/lib/kokkos/bin/nvcc_wrapper -D BUILD_SHARED_LIBS=ON \
-D PKG_MANYBODY=ON -D PKG_MOLECULE=ON -D PKG_KSPACE=ON -D PKG_REPLICA=ON -D PKG_REAXFF=ON -D PKG_QEQ=ON \
-D PKG_PHONON=ON -D PKG_ELECTRODE=yes -D PKG_PLUMED=yes -D DOWNLOAD_PLUMED=yes -D PLUMED_MODE=shared \
-D BUILD_SHARED_LIBS=ON -D PKG_KOKKOS=yes -D Kokkos_ARCH_AMPERE80=ON -D Kokkos_ENABLE_CUDA=yes \
-D CMAKE_PREFIX_PATH=/opt/libtorch-gpu ../cmake
RUN make -j 4
RUN make install
ENV PATH=/opt/lammps/install/bin:$PATH
ENV PATH=/opt/lammps/install/lib:$PATH
ENV PATH=/opt/lammps/install/include:$PATH
ENV LD_LIBRARY_PATH=/opt/lammps/install/lib:$LD_LIBRARY_PATH
I am again finding that pressures are not correctly calculated when LAMMPS + pair_allegro is invoked with Kokkos flags.
Slurm script to use containerized LAMMPS on NERSC with Kokkos
#!/bin/bash
#SBATCH --image docker:nersc/lammps_allegro:23.08
#SBATCH --job-name=pressure-test
#SBATCH --account=mXXXX
#SBATCH --qos=debug
#SBATCH --nodes=1
#SBATCH --constraint=gpu
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=32
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=none
#SBATCH --time=5:00
#SBATCH --error=vt_lammps%j.err
#SBATCH --output=vt_lammps%j.out
#SBATCH [email protected]
#SBATCH --mail-type=ALL
#
#SBATCH --open-mode=append
# OpenMP parallelization
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
# Ensure that stack size is unlimited, or you may get a segfault error when
# attempting to run a MPI job.
ulimit -s unlimited
ulimit -S unlimited
ulimit -H unlimited
# Run LAMMPS
exe="lmp"
input='-k on g 4 -sf kk -pk kokkos newton on neigh full -in input.lammps'
srun --cpu-bind=cores --gpu-bind=none --module mpich,gpu shifter $exe $input
Under NVT conditions when running with Kokkos, the thermo output show nan pressures like this:
Step Time Temp Press PotEng
0 0 298.15 nan -29472.012
1 0.0005 297.55398 nan -29471.599
2 0.001 298.33164 nan -29472.145
3 0.0015 300.09509 nan -29473.349
4 0.002 302.16364 nan -29474.772
and trying to do NPT with Kokkos results in the same error about non-numeric pressure:
Neighbor list info ...
update: every = 1 steps, delay = 0 steps, check = yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 6
ghost atom cutoff = 10
binsize = 10, bins = 4 4 4
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair allegro/kk, perpetual
attributes: full, newton on, kokkos_device
pair build: full/bin/kk/device
stencil: full/bin/3d
bin: kk/device
rebalancing time: 0.002 seconds
iteration count = 26
initial/final maximal load/proc = 1369 1329
initial/final imbalance factor = 1.0563272 1.025463
x cuts: 0 1
y cuts: 0 0.4954834 1
z cuts: 0 0.49346924 1
run 20000
ERROR: Non-numeric pressure - simulation unstable (src/fix_nh.cpp:1049)
Last command: run 20000
However, when running LAMMPS without Kokkos, system pressures are calculated and the NPT simulation completes without issue. It appears that NPT calculations using multiple GPUs are not possible on Perlmutter.
Slurm script to use containerized LAMMPS on NERSC without Kokkos
#!/bin/bash
#SBATCH --image docker:nersc/lammps_allegro:23.08
#SBATCH --job-name=pressure-test
#SBATCH --account=mXXXX
#SBATCH --qos=regular
#SBATCH --nodes=1
#SBATCH --constraint=cpu
#SBATCH --ntasks-per-node=128
#SBATCH --cpus-per-task=1
#SBATCH --time=2:00:00
#SBATCH --error=vt_lammps%j.err
#SBATCH --output=vt_lammps%j.out
#SBATCH [email protected]
#SBATCH --mail-type=ALL
#
#SBATCH --open-mode=append
# OpenMP parallelization
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
# Ensure that stack size is unlimited, or you may get a segfault error when
# attempting to run a MPI job.
ulimit -s unlimited
ulimit -S unlimited
ulimit -H unlimited
# Run LAMMPS
exe="lmp"
input='-in input.lammps'
srun --cpu-bind=cores --gpu-bind=none --module mpich shifter $exe $input
It appears that you used Perlmutter GPU nodes for your scaling experiments in the SC'23 conference paper and A100 GPUs in the scaling testing of Allegro in the Nat. Comm. paper. Were the LAMMPS + pair_allegro builds for these experiments done before mir-allegro and pair_allegro got support for computing the virial tensor? Or was there a specific build configuration you used on Perlmutter to get virial tensor computation and thus support for NPT?
Thanks!
Sorry for getting to this. The recently released version changes "everything", so this problem is hopefully not relevant anymore - reopen if you run into this again.