pair_allegro
pair_allegro copied to clipboard
allegro_pair style and empty partitions
Hello all,
Very happy with the tools - thank you for maintaining them and integrating with LAMMPS. I am running some vacuum simulations and while increasing the number of GPUs (and mpi ranks), I ran into the following issue:
terminate called after throwing an instance of 'std::runtime_error'
what(): The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/nequip/nn/_grad_output.py", line 32, in forward
_6 = torch.append(wrt_tensors, data[k])
func0 = self.func
data0 = (func0).forward(data, )
~~~~~~~~~~~~~~ <--- HERE
of = self.of
_7 = [torch.sum(data0[of])]
File "code/__torch__/nequip/nn/_graph_mixin.py", line 30, in AD_logsumexp_backward
input1 = (radial_basis).forward(input0, )
input2 = (spharm).forward(input1, )
input3 = (allegro).forward(input2, )
~~~~~~~~~~~~~~~~ <--- HERE
input4 = (edge_eng).forward(input3, )
input5 = (edge_eng_sum).forward(input4, )
File "code/__torch__/allegro/nn/_allegro.py", line 168, in mean_0
_n_scalar_outs = self._n_scalar_outs
_38 = torch.slice(_37, 2, None, _n_scalar_outs[0])
scalars = torch.reshape(_38, [(torch.size(features1))[0], -1])
~~~~~~~~~~~~~ <--- HERE
features2 = (_04).forward(features1, )
_39 = annotate(List[Optional[Tensor]], [active_edges0])
Traceback of TorchScript, original code (most recent call last):
File "/home/bepnickc/miniconda3/envs/allegro_and_lammps/lib/python3.9/site-packages/nequip/nn/_grad_output.py", line 84, in forward
wrt_tensors.append(data[k])
# run func
data = self.func(data)
~~~~~~~~~ <--- HERE
# Get grads
grads = torch.autograd.grad(
File "/home/bepnickc/miniconda3/envs/allegro_and_lammps/lib/python3.9/site-packages/nequip/nn/_graph_mixin.py", line 366, in AD_logsumexp_backward
def forward(self, input: AtomicDataDict.Type) -> AtomicDataDict.Type:
for module in self:
input = module(input)
~~~~~~ <--- HERE
return input
File "/home/bepnickc/miniconda3/envs/allegro_and_lammps/lib/python3.9/site-packages/allegro/nn/_allegro.py", line 585, in mean_0
# features has shape [z][mul][k]
# we know scalars are first
scalars = features[:, :, : self._n_scalar_outs[layer_index]].reshape(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
features.shape[0], -1
)
RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] because the unspecified dimension size -1 can be any value and is ambiguous
Seems that its related to the way LAMMPS partitions the simulation box with increasing number of processes:
https://docs.lammps.org/Developer_par_part.html
Where if the rank/partition running that model contains no atoms, the network potential cannot accept the zero size tensor. Is this the intended behaviour of allegro models in LAMMPS?
Happy to run some more tests/provide more information if needed.
Hi,
This is certainly not the intended behavior and a bug that we'll fix. I think it should work already if using Kokkos.
You may, however, also want to avoid having completely empty domains in your simulations. LAMMPS has functionality for resizing domains using balance (statically) or fix balance (periodically), see the LAMMPS documentation. If you're brave, you can try this in combination with comm_style tiled.
Thanks for the response! Of course, it was just something I noticed that happened rarely in a vacuum simulation :) . I will check out those LAMMPS options.
I'm experiencing this behavior with the latest multicut branch from pair_allegro on CPU LAMMPS 02 August 2023, update 3. I did a couple epochs of training on the aspirin dataset (https://github.com/mir-group/allegro/blob/main/configs/example.yaml) and was trying to test that force field on geometry formed from the first frame of the aspirin dataset.
Even when running with a single MPI process (which I would think would prevent empty domains), I'm still getting this "cannot reshape tensor" error. Any ideas on what might be happening?
Packages used to build pair_allegro
- slurm (pmi)
- scl/gcc-toolset-10 (GCC 10 compilers)
- intel/mkl/2024.1 (to be able to compile against libtorch, but not used for FFTs)
- intel/compiler-rt/2024.1.0 (also seem to be needed to use libtorch)
- mpich/gnu/3.3.2 (known working MPI implementation)
- libtorch/2.0.0+cpu-cxx11-abi (known working version of libtorch on NERSC Perlmutter's lammps_allegro image)
- amd/aocl/gcc/4.0 (Using AOCL-provided FFTW3 libraries for FFTs, built with GCC)
cmake command to build LAMMPS after patching with pair_allegro "multicut" branch
cmake ../cmake \
-D CMAKE_BUILD_TYPE=Debug \
-D LAMMPS_EXCEPTIONS=ON \
-D BUILD_SHARED_LIBS=ON \
-D BUILD_MPI=yes \
-D BUILD_OMP=yes \
-C ../cmake/presets/kokkos-openmp.cmake \
-D PKG_KOKKOS=yes \
-D Kokkos_ARCH_ZEN3=yes \
-D BUILD_TOOLS=no \
-D FFT=FFTW3 \
-D FFT_KOKKOS=FFT3W \
-D FFTW3_INCLUDE_DIR=$AOCL_ROOT/include \
-D FFTW3_LIBRARY=$AOCL_LIB/libfftw3.so \
-D FFTW3_OMP_LIBRARY=$AOCL_LIB/libfftw3_omp.so \
-D CMAKE_INSTALL_PREFIX="$LAMMPS_ROOT" \
-D PKG_MANYBODY=yes \
-D PKG_MOLECULE=yes \
-D PKG_KSPACE=yes \
-D PKG_REPLICA=yes \
-D PKG_ASPHERE=yes \
-D PKG_RIGID=yes \
-D PKG_MPIIO=yes \
-D PKG_COMPRESS=yes \
-D PKG_H5MD=no \
-D PKG_OPENMP=yes \
-D CMAKE_POSITION_INDEPENDENT_CODE=yes \
-D CMAKE_EXE_FLAGS="-dynamic" \
-D FFT_FFTW_THREADS=on
$LAMMPS_ROOT is the custom prefix where I make install LAMMPS after compilation. $AOCL_ROOT and $AOCL_LIB are the installation location and library location of AOCL 4.0 on my system.
Input geometry (aspirin-with-topo.data)
LAMMPS data file. CGCMM style. atom_style full generated by VMD/TopoTools v1.8 on Wed Jul 10 16:22:01 -0400 2024
21 atoms
21 bonds
0 angles
0 dihedrals
0 impropers
3 atom types
1 bond types
0 angle types
0 dihedral types
0 improper types
-5.163345 4.836655 xlo xhi
-5.155887 4.844113 ylo yhi
-5.014088 4.985912 zlo zhi
# Pair Coeffs
#
# 1 C
# 2 H
# 3 O
# Bond Coeffs
#
# 1
Masses
1 12.010700 # C
2 1.007940 # H
3 15.999400 # O
Atoms # full
1 1 1 0.000000 2.134489 -0.984361 -0.195218 # C
2 1 1 0.000000 0.762644 0.959414 -1.679929 # C
3 1 1 0.000000 2.660345 -0.407926 -1.307304 # C
4 1 1 0.000000 1.910317 0.393966 -2.147020 # C
5 1 1 0.000000 -3.030190 1.495405 0.719662 # C
6 1 1 0.000000 0.849425 -0.550871 0.284375 # C
7 1 1 0.000000 0.238447 0.473506 -0.404422 # C
8 1 3 0.000000 0.897896 -2.276432 1.730061 # O
9 1 3 0.000000 -2.383452 0.417779 -1.462857 # O
10 1 3 0.000000 -0.476201 -0.529087 2.339259 # O
11 1 1 0.000000 0.392992 -1.190237 1.537982 # C
12 1 1 0.000000 -2.122985 0.951760 -0.397705 # C
13 1 3 0.000000 -0.804666 1.286246 0.110509 # O
14 1 2 0.000000 -0.493803 -1.186792 3.095974 # H
15 1 2 0.000000 2.554735 -1.802497 0.392131 # H
16 1 2 0.000000 0.330690 1.855711 -2.345264 # H
17 1 2 0.000000 3.803794 -0.493719 -1.456203 # H
18 1 2 0.000000 2.231141 0.557186 -3.124150 # H
19 1 2 0.000000 -2.708930 2.484658 0.926928 # H
20 1 2 0.000000 -4.130483 1.482167 0.431266 # H
21 1 2 0.000000 -2.874148 1.003209 1.699485 # H
Bonds
1 1 1 3
2 1 1 15
3 1 1 6
4 1 2 4
5 1 2 16
6 1 2 7
7 1 3 4
8 1 3 17
9 1 4 18
10 1 5 12
11 1 5 20
12 1 5 19
13 1 5 21
14 1 6 7
15 1 6 11
16 1 7 13
17 1 8 11
18 1 9 12
19 1 10 11
20 1 10 14
21 1 12 13
LAMMPS input file (input.lammps)
# PART A - ENERGY MINIMIZATION
# 1) Initialization
units metal
dimension 3
atom_style full
boundary p p p
# 2) System definition
read_data aspirin-with-topo.data
# 3) Simulation settings
pair_style allegro3232 # Accidentally trained with float32 dtypes
pair_coeff * * aspirin-model_11-jul-2024.pth C H O
# 4) Visualization
thermo 1
thermo_style custom step time temp pe ke etotal epair ebond econserve fmax
# Also want to dump the CG minimization trajectory
dump mintraj all atom 1 minimization.lammpstrj
# 5) Run
minimize 1.0e-4 1.0e-6 1000 10000
undump mintraj
# PART B - MOLECULAR DYNAMICS
delete_atoms overlap 0.1 all all
# Logging
thermo 10
# Try to rebuild neighbor lists more often
neigh_modify every 1 delay 0 check yes
# Run MD
fix mynve all nve
fix mylgv all langevin 1.0 1.0 0.1 1530917
# Be sure to dump the MD trajectory
dump mdtraj all atom 1 mdtraj.lammpstrj
dump mdforces all custom 1 mdforces.lammpstrj x y z vx vy vz fx fy fz
timestep 0.5
run 1000
undump mdtraj
Command to run LAMMPS
srun lmp -k on -sf kk -pk kokkos neigh full -in input.lammps
LAMMPS output
LAMMPS (2 Aug 2023 - Update 3)
KOKKOS mode with Kokkos version 3.7.2 is enabled (src/KOKKOS/kokkos.cpp:108)
using 1 OpenMP thread(s) per MPI task
package kokkos
package kokkos neigh full
# PART A - ENERGY MINIMIZATION
# 1) Initialization
units metal
dimension 3
atom_style full
boundary p p p
# 2) System definition
read_data aspirin-with-topo.data
Reading data file ...
orthogonal box = (-5.163345 -5.155887 -5.014088) to (4.836655 4.844113 4.985912)
1 by 1 by 1 MPI processor grid
reading atoms ...
21 atoms
scanning bonds ...
4 = max bonds/atom
reading bonds ...
21 bonds
Finding 1-2 1-3 1-4 neighbors ...
special bond factors lj: 0 0 0
special bond factors coul: 0 0 0
4 = max # of 1-2 neighbors
6 = max # of 1-3 neighbors
13 = max # of 1-4 neighbors
15 = max # of special neighbors
special bonds CPU = 0.004 seconds
read_data CPU = 0.018 seconds
# 3) Simulation settings
pair_style allegro3232
pair_coeff * * aspirin-model_11-jul-2024.pth C H O
# 4) Visualization
thermo 1
thermo_style custom step time temp pe ke etotal epair ebond econserve fmax
# Also want to dump the CG minimization trajectory
dump mintraj all atom 1 minimization.lammpstrj
# 5) Run
minimize 1.0e-4 1.0e-6 1000 10000
WARNING: Using a manybody potential with bonds/angles/dihedrals and special_bond exclusions (src/pair.cpp:242)
WARNING: Bonds are defined but no bond style is set (src/force.cpp:193)
WARNING: Likewise 1-2 special neighbor interactions != 1.0 (src/force.cpp:195)
Neighbor list info ...
update: every = 1 steps, delay = 0 steps, check = yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 8
ghost atom cutoff = 8
binsize = 4, bins = 3 3 3
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair allegro3232/kk, perpetual
attributes: full, newton on, kokkos_device
pair build: full/bin/kk/device
stencil: full/bin/3d
bin: kk/device
Per MPI rank memory allocation (min/avg/max) = 8.928 | 8.928 | 8.928 Mbytes
Step Time Temp PotEng KinEng TotEng E_pair E_bond Econserve Fmax
0 0 0 -405675.22 0 -405675.22 -405675.22 0 -405675.22 13.409355
1 0.001 0 -405679.39 0 -405679.39 -405679.39 0 -405679.39 6.1880095
Loop time of 4.11094 on 1 procs for 1 steps with 21 atoms
96.3% CPU use with 1 MPI tasks x 1 OpenMP threads
Minimization stats:
Stopping criterion = energy tolerance
Energy initial, next-to-last, final =
-405675.21875 -405675.21875 -405679.388671875
Force two-norm initial, final = 27.207462 16.492158
Force max component initial, final = 13.409355 6.1880095
Final line search alpha, max atom move = 0.0074574801 0.046146958
Iterations, force evaluations = 1 1
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
---------------------------------------------------------------
Pair | 4.1102 | 4.1102 | 4.1102 | 0.0 | 99.98
Bond | 7.06e-06 | 7.06e-06 | 7.06e-06 | 0.0 | 0.00
Neigh | 0 | 0 | 0 | 0.0 | 0.00
Comm | 4.4729e-05 | 4.4729e-05 | 4.4729e-05 | 0.0 | 0.00
Output | 0 | 0 | 0 | 0.0 | 0.00
Modify | 0 | 0 | 0 | 0.0 | 0.00
Other | | 0.0006423 | | | 0.02
Nlocal: 21 ave 21 max 21 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost: 510 ave 510 max 510 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs: 0 ave 0 max 0 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs: 596 ave 596 max 596 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Total # of neighbors = 596
Ave neighs/atom = 28.380952
Ave special neighs/atom = 8.5714286
Neighbor list builds = 0
Dangerous builds = 0
LAMMPS stderr
Exception: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/nequip/nn/_graph_model.py", line 29, in forward
pass
model = self.model
return (model).forward(new_data, )
~~~~~~~~~~~~~~ <--- HERE
File "code/__torch__/nequip/nn/_rescale.py", line 21, in batch_norm
data: Dict[str, Tensor]) -> Dict[str, Tensor]:
model = self.model
data0 = (model).forward(data, )
~~~~~~~~~~~~~~ <--- HERE
training = self.training
if training:
File "code/__torch__/nequip/nn/_grad_output.py", line 71, in layer_norm
pass
func = self.func
data0 = (func).forward(data, )
~~~~~~~~~~~~~ <--- HERE
_17 = [torch.sum(data0["total_energy"])]
_18 = [pos, data0["_displacement"]]
File "code/__torch__/nequip/nn/_graph_mixin.py", line 28, in AD_sum_backward
input1 = (radial_basis).forward(input0, )
input2 = (spharm).forward(input1, )
input3 = (allegro).forward(input2, )
~~~~~~~~~~~~~~~~ <--- HERE
input4 = (edge_eng).forward(input3, )
input5 = (edge_eng_sum).forward(input4, )
File "code/__torch__/allegro/nn/_allegro.py", line 168, in mean_0
_n_scalar_outs = self._n_scalar_outs
_38 = torch.slice(_37, 2, None, _n_scalar_outs[0])
scalars = torch.reshape(_38, [(torch.size(features1))[0], -1])
~~~~~~~~~~~~~ <--- HERE
features2 = (_04).forward(features1, )
_39 = annotate(List[Optional[Tensor]], [active_edges0])
Traceback of TorchScript, original code (most recent call last):
File "<path_to_nequip_conda_env>/lib/python3.10/site-packages/nequip/nn/_graph_model.py", line 112, in forward
new_data[k] = v
# run the model
data = self.model(new_data)
~~~~~~~~~~ <--- HERE
return data
File "<path_to_nequip_conda_env>/lib/python3.10/site-packages/nequip/nn/_rescale.py", line 144, in batch_norm
def forward(self, data: AtomicDataDict.Type) -> AtomicDataDict.Type:
data = self.model(data)
~~~~~~~~~~ <--- HERE
if self.training:
# no scaling, but still need to promote for consistent dtype behavior
File "<path_to_nequip_conda_env>/lib/python3.10/site-packages/nequip/nn/_grad_output.py", line 305, in layer_norm
# Call model and get gradients
data = self.func(data)
~~~~~~~~~ <--- HERE
grads = torch.autograd.grad(
File "<path_to_nequip_conda_env>/lib/python3.10/site-packages/nequip/nn/_graph_mixin.py", line 366, in AD_sum_backward
def forward(self, input: AtomicDataDict.Type) -> AtomicDataDict.Type:
for module in self:
input = module(input)
~~~~~~ <--- HERE
return input
File "<path_to_nequip_conda_env>/lib/python3.10/site-packages/allegro/nn/_allegro.py", line 585, in mean_0
# features has shape [z][mul][k]
# we know scalars are first
scalars = features[:, :, : self._n_scalar_outs[layer_index]].reshape(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
features.shape[0], -1
)
RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] because the unspecified dimension size -1 can be any value and is ambiguous
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
slurmstepd: error: *** STEP 3194616.0 ON <nodeid> CANCELLED AT 2024-07-11T20:47:53 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: <nodeid>: task 0: Killed
Hi @samueldyoung29ctr ,
This should be fixed now but on main, we've since cleaned up and merged everything down. If you can confirm, I will close this issue. Thanks!
@Linux-cpp-lisp, thanks for the tip. I think I actually had a bad timestep and/or starting geometry. After fixing things, my simple Allegro model appears to evaluate for pair_allegro compiled both from the multicut and main branches.
Great! Glad it's resolved. (You can probably stay on main then, since it is the latest.)
Hi @Linux-cpp-lisp, I'm running into this error again and am looking for some more guidance. This time, I tried a simpler system of water in a box. I trained an Allegro model on this dataset (dataset_1593.xyz, which I converted to ASE .traj format) from Cheng, et al. "Ab initio thermodynamics of liquid and solid water". I used a ~90/10 train/validation split, lmax = 2, and altered some of the default numbers of MLP layers. This dataset has some water molecules outside of the simulation box, so I trained both with and without first wrapping the atoms inside the box.
NequIP configuration for unwrapped dataset.
The configuration for the wrapped dataset differs only in the dataset_file_name and run_name, which reflect the wrapped version of the dataset I used.
BesselBasis_trainable: false
PolynomialCutoff_p: 48
append: true
ase_args:
format: traj
avg_num_neighbors: auto
batch_size: 1
chemical_symbols:
- H
- O
dataset: ase
dataset_file_name: ./data.traj
dataset_seed: 123456
default_dtype: float32
early_stopping_lower_bounds:
LR: 1.0e-05
early_stopping_patiences:
validation_loss: 1000
early_stopping_upper_bounds:
cumulative_wall: 604800.0
edge_eng_mlp_initialization: uniform
edge_eng_mlp_latent_dimensions:
- 32
edge_eng_mlp_nonlinearity: null
ema_decay: 0.99
ema_use_num_updates: true
embed_initial_edge: true
env_embed_mlp_initialization: uniform
env_embed_mlp_latent_dimensions: []
env_embed_mlp_nonlinearity: null
env_embed_multiplicity: 64
l_max: 2
latent_mlp_initialization: uniform
latent_mlp_latent_dimensions:
- 64
- 64
- 64
- 64
latent_mlp_nonlinearity: silu
latent_resnet: true
learning_rate: 0.001
loss_coeffs:
forces: 1.0
total_energy:
- 1.0
- PerAtomMSELoss
lr_scheduler_factor: 0.5
lr_scheduler_name: ReduceLROnPlateau
lr_scheduler_patience: 25
max_epochs: 1000000
metrics_key: validation_loss
model_builders:
- allegro.model.Allegro
- PerSpeciesRescale
- StressForceOutput
- RescaleEnergyEtc
n_train: 1434
n_val: 159
num_layers: 1
optimizer_name: Adam
optimizer_params:
amsgrad: false
betas: !!python/tuple
- 0.9
- 0.999
eps: 1.0e-08
weight_decay: 0.0
parity: o3_full
r_max: 4.0
root: <root name>
run_name: <run name>
seed: 123456
shuffle: true
train_val_split: random
two_body_latent_mlp_initialization: uniform
two_body_latent_mlp_latent_dimensions:
- 32
- 64
two_body_latent_mlp_nonlinearity: silu
use_ema: true
verbose: debug
wandb: true
wandb_project: <project name>
The Conda environment I am using to train has NequIP 0.6.0 and Allegro symlinked as a development package at commit 22f673c.
Conda environment used to train Allegro model
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_kmp_llvm conda-forge
appdirs 1.4.4 pyh9f0ad1d_0 conda-forge
ase 3.23.0 pyhd8ed1ab_0 conda-forge
asttokens 2.4.1 pyhd8ed1ab_0 conda-forge
blinker 1.8.2 pyhd8ed1ab_0 conda-forge
brotli 1.1.0 hd590300_1 conda-forge
brotli-bin 1.1.0 hd590300_1 conda-forge
brotli-python 1.1.0 py310hc6cd4ac_1 conda-forge
bzip2 1.0.8 h4bc722e_7 conda-forge
ca-certificates 2024.7.4 hbcca054_0 conda-forge
certifi 2024.7.4 pyhd8ed1ab_0 conda-forge
cffi 1.16.0 py310h2fee648_0 conda-forge
charset-normalizer 3.3.2 pyhd8ed1ab_0 conda-forge
click 8.1.7 unix_pyh707e725_0 conda-forge
colorama 0.4.6 pyhd8ed1ab_0 conda-forge
comm 0.2.2 pyhd8ed1ab_0 conda-forge
contourpy 1.2.1 py310hd41b1e2_0 conda-forge
cuda-version 11.8 h70ddcb2_3 conda-forge
cudatoolkit 11.8.0 h4ba93d1_13 conda-forge
cudnn 8.9.7.29 hbc23b4c_3 conda-forge
cycler 0.12.1 pyhd8ed1ab_0 conda-forge
debugpy 1.8.2 py310h76e45a6_0 conda-forge
decorator 5.1.1 pyhd8ed1ab_0 conda-forge
docker-pycreds 0.4.0 py_0 conda-forge
e3nn 0.4.4 pyhd8ed1ab_1 conda-forge
exceptiongroup 1.2.0 pyhd8ed1ab_2 conda-forge
executing 2.0.1 pyhd8ed1ab_0 conda-forge
flask 3.0.3 pyhd8ed1ab_0 conda-forge
fonttools 4.53.0 py310hc51659f_0 conda-forge
freetype 2.12.1 h267a509_2 conda-forge
gitdb 4.0.11 pyhd8ed1ab_0 conda-forge
gitpython 3.1.43 pyhd8ed1ab_0 conda-forge
gmp 6.3.0 hac33072_2 conda-forge
gmpy2 2.1.5 py310hc7909c9_1 conda-forge
h2 4.1.0 pyhd8ed1ab_0 conda-forge
hpack 4.0.0 pyh9f0ad1d_0 conda-forge
hyperframe 6.0.1 pyhd8ed1ab_0 conda-forge
icu 73.2 h59595ed_0 conda-forge
idna 3.7 pyhd8ed1ab_0 conda-forge
importlib-metadata 8.0.0 pyha770c72_0 conda-forge
importlib_metadata 8.0.0 hd8ed1ab_0 conda-forge
ipykernel 6.29.5 pyh3099207_0 conda-forge
ipython 8.26.0 pyh707e725_0 conda-forge
itsdangerous 2.2.0 pyhd8ed1ab_0 conda-forge
jedi 0.19.1 pyhd8ed1ab_0 conda-forge
jinja2 3.1.4 pyhd8ed1ab_0 conda-forge
joblib 1.4.2 pyhd8ed1ab_0 conda-forge
jupyter_client 8.6.2 pyhd8ed1ab_0 conda-forge
jupyter_core 5.7.2 py310hff52083_0 conda-forge
keyutils 1.6.1 h166bdaf_0 conda-forge
kiwisolver 1.4.5 py310hd41b1e2_1 conda-forge
krb5 1.21.3 h659f571_0 conda-forge
lcms2 2.16 hb7c19ff_0 conda-forge
ld_impl_linux-64 2.40 hf3520f5_7 conda-forge
lerc 4.0.0 h27087fc_0 conda-forge
libblas 3.9.0 16_linux64_mkl conda-forge
libbrotlicommon 1.1.0 hd590300_1 conda-forge
libbrotlidec 1.1.0 hd590300_1 conda-forge
libbrotlienc 1.1.0 hd590300_1 conda-forge
libcblas 3.9.0 16_linux64_mkl conda-forge
libdeflate 1.20 hd590300_0 conda-forge
libedit 3.1.20191231 he28a2e2_2 conda-forge
libffi 3.4.2 h7f98852_5 conda-forge
libgcc-ng 14.1.0 h77fa898_0 conda-forge
libgfortran-ng 14.1.0 h69a702a_0 conda-forge
libgfortran5 14.1.0 hc5f4f2c_0 conda-forge
libhwloc 2.11.0 default_h5622ce7_1000 conda-forge
libiconv 1.17 hd590300_2 conda-forge
libjpeg-turbo 3.0.0 hd590300_1 conda-forge
liblapack 3.9.0 16_linux64_mkl conda-forge
libnsl 2.0.1 hd590300_0 conda-forge
libopenblas 0.3.27 pthreads_hac2b453_1 conda-forge
libpng 1.6.43 h2797004_0 conda-forge
libprotobuf 3.20.3 h3eb15da_0 conda-forge
libsodium 1.0.18 h36c2ea0_1 conda-forge
libsqlite 3.46.0 hde9e2c9_0 conda-forge
libstdcxx-ng 14.1.0 hc0a3c3a_0 conda-forge
libtiff 4.6.0 h1dd3fc0_3 conda-forge
libuuid 2.38.1 h0b41bf4_0 conda-forge
libwebp-base 1.4.0 hd590300_0 conda-forge
libxcb 1.16 hd590300_0 conda-forge
libxcrypt 4.4.36 hd590300_1 conda-forge
libxml2 2.12.7 h4c95cb1_3 conda-forge
libzlib 1.3.1 h4ab18f5_1 conda-forge
llvm-openmp 18.1.8 hf5423f3_0 conda-forge
magma 2.5.4 hc72dce7_4 conda-forge
markupsafe 2.1.5 py310h2372a71_0 conda-forge
matplotlib-base 3.8.4 py310hef631a5_2 conda-forge
matplotlib-inline 0.1.7 pyhd8ed1ab_0 conda-forge
mir-allegro 0.2.0 dev_0 <develop>
mkl 2022.2.1 h6508926_16999 conda-forge
mpc 1.3.1 hfe3b2da_0 conda-forge
mpfr 4.2.1 h9458935_1 conda-forge
mpmath 1.3.0 pyhd8ed1ab_0 conda-forge
munkres 1.1.4 pyh9f0ad1d_0 conda-forge
nccl 2.22.3.1 hee583db_0 conda-forge
ncurses 6.5 h59595ed_0 conda-forge
nequip 0.6.0 dev_0 <develop>
nest-asyncio 1.6.0 pyhd8ed1ab_0 conda-forge
ninja 1.12.1 h297d8ca_0 conda-forge
numpy 1.26.4 py310hb13e2d6_0 conda-forge
openjpeg 2.5.2 h488ebb8_0 conda-forge
openssl 3.3.1 h4ab18f5_1 conda-forge
opt-einsum 3.3.0 hd8ed1ab_2 conda-forge
opt_einsum 3.3.0 pyhc1e730c_2 conda-forge
opt_einsum_fx 0.1.4 pyhd8ed1ab_0 conda-forge
packaging 24.1 pyhd8ed1ab_0 conda-forge
parso 0.8.4 pyhd8ed1ab_0 conda-forge
pathtools 0.1.2 py_1 conda-forge
pexpect 4.9.0 pyhd8ed1ab_0 conda-forge
pickleshare 0.7.5 py_1003 conda-forge
pillow 10.4.0 py310hebfe307_0 conda-forge
pip 24.0 pyhd8ed1ab_0 conda-forge
platformdirs 4.2.2 pyhd8ed1ab_0 conda-forge
prompt-toolkit 3.0.47 pyha770c72_0 conda-forge
protobuf 3.20.3 py310heca2aa9_1 conda-forge
psutil 6.0.0 py310hc51659f_0 conda-forge
pthread-stubs 0.4 h36c2ea0_1001 conda-forge
ptyprocess 0.7.0 pyhd3deb0d_0 conda-forge
pure_eval 0.2.2 pyhd8ed1ab_0 conda-forge
pycparser 2.22 pyhd8ed1ab_0 conda-forge
pygments 2.18.0 pyhd8ed1ab_0 conda-forge
pyparsing 3.1.2 pyhd8ed1ab_0 conda-forge
pysocks 1.7.1 py310hff52083_5 conda-forge
python 3.10.14 hd12c33a_0_cpython conda-forge
python-dateutil 2.9.0post0 py310h06a4308_2
python_abi 3.10 4_cp310 conda-forge
pytorch 1.11.0 cuda112py310h51fe464_202 conda-forge
pyyaml 6.0.1 py310h2372a71_1 conda-forge
pyzmq 26.0.3 py310h6883aea_0 conda-forge
readline 8.2 h8228510_1 conda-forge
requests 2.32.3 pyhd8ed1ab_0 conda-forge
scikit-learn 1.0.1 py310h1246948_3 conda-forge
scipy 1.14.0 py310h93e2701_1 conda-forge
sentry-sdk 2.10.0 pyhd8ed1ab_0 conda-forge
setproctitle 1.3.3 py310h2372a71_0 conda-forge
setuptools 70.1.1 pyhd8ed1ab_0 conda-forge
six 1.16.0 pyh6c4a22f_0 conda-forge
sleef 3.5.1 h9b69904_2 conda-forge
smmap 5.0.0 pyhd8ed1ab_0 conda-forge
stack_data 0.6.2 pyhd8ed1ab_0 conda-forge
sympy 1.12.1 pypyh2585a3b_103 conda-forge
tbb 2021.12.0 h434a139_2 conda-forge
threadpoolctl 3.5.0 pyhc1e730c_0 conda-forge
tk 8.6.13 noxft_h4845f30_101 conda-forge
torch-ema 0.3 pyhd8ed1ab_0 conda-forge
torch-runstats 0.2.0 pyhd8ed1ab_0 conda-forge
tornado 6.4.1 py310hc51659f_0 conda-forge
tqdm 4.66.4 pyhd8ed1ab_0 conda-forge
traitlets 5.14.3 pyhd8ed1ab_0 conda-forge
typing-extensions 4.12.2 hd8ed1ab_0 conda-forge
typing_extensions 4.12.2 pyha770c72_0 conda-forge
tzdata 2024a h0c530f3_0 conda-forge
unicodedata2 15.1.0 py310h2372a71_0 conda-forge
urllib3 2.2.2 pyhd8ed1ab_1 conda-forge
wandb 0.16.6 pyhd8ed1ab_0 conda-forge
wcwidth 0.2.13 pyhd8ed1ab_0 conda-forge
werkzeug 3.0.3 pyhd8ed1ab_0 conda-forge
wheel 0.43.0 pyhd8ed1ab_1 conda-forge
xorg-libxau 1.0.11 hd590300_0 conda-forge
xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge
xz 5.2.6 h166bdaf_0 conda-forge
yaml 0.2.5 h7f98852_2 conda-forge
zeromq 4.3.5 h75354e8_4 conda-forge
zipp 3.19.2 pyhd8ed1ab_0 conda-forge
zstandard 0.23.0 py310h64cae3c_0 conda-forge
zstd 1.5.6 ha6fb4c9_0 conda-forge
Training was done on Nvidia A100 GPUs. The training converges quickly since this dataset doesn't have very large forces on the atoms. I then deployed the models to standalone format:
nequip-deploy build --train-dir "<path to train dir of model for unwrapped data>" model-unwrapped_08-aug-2024.pth
nequip-deploy build --train-dir "<path to train dir of model for wrapped data>" model-wrapped_08-aug-2024.pth
I then used the standalone model files to run LAMMPS jobs on a different cluster where I have more computer time. I compiled LAMMPS for CPU with pair_allegro and Kokkos (a combination which apparently is not available yet on NERSC). I used the 02Aug2023 version of LAMMPS and, based on your previous advice, patched it with pair_allegro commit 20538c9, which is the current main and one commit later than the patch (89e3ce1) that was supposed to fix empty domains in non-Kokkos simulations. I compiled LAMMPS with GCC 10 compilers, FFTW3 from AMD AOCL 4.0 (compiled with GCC compilers), GNU MPICH 3.3.2, and also exposed Intel oneAPI 2024.1 MKL libraries to satisfy the cmake preprocessing step. I linked against libtorch 2.0.0 (CPU, CXX11 ABI, available here) based on the advice of NERSC consultants.
My CMake generation command
cmake ../cmake \
-D CMAKE_BUILD_TYPE=Debug \
-D LAMMPS_EXCEPTIONS=ON \
-D BUILD_SHARED_LIBS=ON \
-D BUILD_MPI=yes \
-D BUILD_OMP=yes \
-C ../cmake/presets/kokkos-openmp.cmake \
-D PKG_KOKKOS=yes \
-D Kokkos_ARCH_ZEN3=yes \
-D BUILD_TOOLS=no \
-D FFT=FFTW3 \
-D FFT_KOKKOS=FFT3W \
-D FFTW3_INCLUDE_DIR=$AOCL_ROOT/include \
-D FFTW3_LIBRARY=$AOCL_LIB/libfftw3.so \
-D FFTW3_OMP_LIBRARY=$AOCL_LIB/libfftw3_omp.so \
-D CMAKE_INSTALL_PREFIX="$LAMMPS_ROOT" \
-D PKG_MANYBODY=yes \
-D PKG_MOLECULE=yes \
-D PKG_KSPACE=yes \
-D PKG_REPLICA=yes \
-D PKG_ASPHERE=yes \
-D PKG_RIGID=yes \
-D PKG_MPIIO=yes \
-D PKG_COMPRESS=yes \
-D PKG_H5MD=no \
-D PKG_OPENMP=yes \
-D CMAKE_POSITION_INDEPENDENT_CODE=yes \
-D CMAKE_EXE_FLAGS="-dynamic" \
-D FFT_FFTW_THREADS=on
I set up several LAMMPS jobs using as initial geometries four randomly selected frames of that dataset_1593.xyz dataset and my two trained Allegro models (model-wrapped_08-aug-2024.pth and model-unwrapped_08-aug-2024.pth).
Example LAMMPS input script (input.lammps)
# PART A - ENERGY MINIMIZATION
# 1) Initialization
units metal
dimension 3
atom_style atomic
boundary p p p
# 2) System definition
# initial_frame.data will be written into the working directory where this
# script is located.
read_data initial_frame.data
# 3) Simulation settings
# pair_style lj/cut 2.5
mass 1 2.016
mass 2 15.999
pair_style allegro3232
pair_coeff * * ../../h2o-behlerdataset-allegro-train-layerhpo-fe76_07-aug-2024/model-unwrapped_08-aug-2024.pth H O
# Or "model-wrapped_08-aug-2024.pth" if using frames from the wrapped dataset.
# 4) Visualization
thermo 1
thermo_style custom step time temp pe ke etotal epair ebond econserve fmax
# Also want to dump the CG minimization trajectory
dump mintraj all atom 1 minimization.lammpstrj
# 5) Run CG minimization, doing a single static balance first to print out subdomain cut locations.
balance 1.0 shift xyz 100 1.0
minimize 1.0e-8 1.0e-8 1000 1000000000
undump mintraj
reset_timestep 0 time 0.0
# PART B - MOLECULAR DYNAMICS
delete_atoms overlap 0.1 all all
# Logging
thermo 1
# Try to rebuild neighbor lists more often
neigh_modify every 1 delay 0 check yes binsize 10.0
# Also try to specify larger cutoff for ghost atoms to avoid losing atoms.
comm_modify mode single cutoff 10.0 vel yes
# Try specifying initial velocities for all atoms
velocity all create 298.0 4928459 dist gaussian
# Run MD in the NVT ensemble, with a Nosé-Hoover thermostat-barostat starting at 298.0 K.
fix mynose all nvt &
temp 298.0 298.0 0.011 &
# Be sure to dump the MD trajectory
dump mdtraj all atom 1 mdtraj.lammpstrj
dump mdforces all custom 40 mdforces.lammpstrj id x y z vx vy vz fx fy fz
timestep 0.0005
# Normal run, with a single balance first
balance 1.0 shift xyz 100 1.0
run 20000
undump mdtraj
undump mdforces
Example input geometry (frame 596 of the original dataset)
model-unwrapped_08-aug-2024.pth-frame596-kokkos/initial_frame.data (written by ASE)
192 atoms
2 atom types
0.0 23.14461 xlo xhi
0.0 23.14461 ylo yhi
0.0 23.14461 zlo zhi
Atoms
1 2 5.9091500000000003 9.6715599999999995 3.23007
2 1 6.0198799999999997 11.3194 2.4536199999999999
3 1 5.4012099999999998 9.8348099999999992 4.9311199999999999
4 2 22.080200000000001 1.6895899999999999 13.5389
5 1 21.2834 0.52559500000000003 14.782999999999999
6 1 20.552199999999999 2.33968 12.6881
7 2 11.1053 14.1869 6.9613100000000001
8 1 10.151300000000001 13.8668 5.6778700000000004
9 1 12.3499 12.8649 6.86449
10 2 16.049700000000001 19.7988 10.1698
11 1 17.095400000000001 18.6266 9.2250499999999995
12 1 16.745999999999999 19.773900000000001 11.8696
13 2 7.2639100000000001 14.3575 22.238099999999999
14 1 6.4026899999999998 14.9102 20.775099999999998
15 1 8.9776199999999999 14.8261 21.689800000000002
16 2 14.0701 11.291600000000001 13.6586
17 1 13.208399999999999 12.4938 12.5162
18 1 13.355499999999999 11.535299999999999 15.2211
19 2 2.2179700000000002 6.5658799999999999 11.150600000000001
20 1 2.0281199999999999 7.61151 12.1075
21 1 2.2052399999999999 5.03423 12.0829
22 2 23.1968 17.582100000000001 12.8947
23 1 1.1530100000000001 16.1249 13.3102
24 1 1.0125599999999999 18.971399999999999 12.323700000000001
25 2 13.076599999999999 14.747400000000001 0.96582699999999999
26 1 14.6065 15.859400000000001 1.2677700000000001
27 1 13.7464 13.523099999999999 2.1084399999999999
28 2 22.289999999999999 23.165099999999999 7.9897999999999998
29 1 22.927099999999999 0.161326 9.6950800000000008
30 1 0.51338300000000003 22.870999999999999 6.6900599999999999
31 2 18.6097 13.7941 11.648199999999999
32 1 18.463799999999999 15.2698 10.5924
33 1 17.077100000000002 13.6417 12.535
34 2 8.8639799999999997 22.649000000000001 3.5996299999999999
35 1 10.118399999999999 21.665299999999998 4.2162199999999999
36 1 9.3582199999999993 0.085460999999999995 1.9719199999999999
37 2 2.05017 16.757200000000001 4.8366499999999997
38 1 1.44848 15.0244 5.2370700000000001
39 1 3.0690900000000001 17.333600000000001 6.0353300000000001
40 2 22.596299999999999 10.3612 14.7997
41 1 21.734500000000001 8.8241099999999992 14.8378
42 1 21.906500000000001 11.1448 13.2875
43 2 4.3652499999999996 18.384399999999999 19.2408
44 1 3.2898800000000001 16.997699999999998 18.443200000000001
45 1 4.6726099999999997 19.560600000000001 17.828199999999999
46 2 5.5318699999999996 9.4371899999999993 7.85025
47 1 6.7868599999999999 8.3062900000000006 8.1544899999999991
48 1 3.9088400000000001 8.6284299999999998 8.6374399999999998
49 2 20.0229 4.7849899999999996 20.125499999999999
50 1 21.5364 5.3525200000000002 19.273299999999999
51 1 18.720700000000001 5.8096899999999998 19.124700000000001
52 2 13.950200000000001 22.8736 21.118300000000001
53 1 14.283899999999999 21.484200000000001 22.181999999999999
54 1 12.941000000000001 22.204699999999999 19.863
55 2 20.478899999999999 6.6597200000000001 7.2229599999999996
56 1 20.589700000000001 8.1250999999999998 8.3511699999999998
57 1 19.509 7.0342799999999999 5.7630600000000003
58 2 11.3894 3.5159799999999999 0.69348500000000002
59 1 12.741899999999999 3.6259199999999998 1.8147
60 1 12.0502 2.2654100000000001 22.613800000000001
61 2 4.29528 8.6966599999999996 18.260999999999999
62 1 5.0197099999999999 7.1608200000000002 19.0062
63 1 5.0883900000000004 9.8124300000000009 17.133199999999999
64 2 19.012699999999999 15.0678 21.8294
65 1 20.098199999999999 13.677 21.0869
66 1 19.755400000000002 16.159800000000001 22.837800000000001
67 2 18.5093 18.2121 14.6373
68 1 18.847100000000001 19.651299999999999 15.6972
69 1 20.258400000000002 17.533999999999999 14.4245
70 2 20.737200000000001 6.5509399999999998 1.7896799999999999
71 1 20.214300000000001 6.2219499999999996 0.10664999999999999
72 1 20.418800000000001 5.1312300000000004 2.89337
73 2 11.9665 15.8748 12.234299999999999
74 1 11.057 15.8642 10.595700000000001
75 1 12.898199999999999 17.384 12.1503
76 2 7.3605999999999998 15.648300000000001 3.7366000000000001
77 1 6.3703000000000003 15.7403 2.1144099999999999
78 1 6.0848300000000002 15.904400000000001 5.0577399999999999
79 2 11.4945 12.6968 18.411100000000001
80 1 11.9612 11.419499999999999 19.6524
81 1 12.6492 14.269299999999999 18.424700000000001
82 2 6.2639100000000001 4.2544399999999998 21.3217
83 1 4.6653799999999999 3.6882799999999998 22.102699999999999
84 1 7.4899199999999997 4.5268499999999996 22.556100000000001
85 2 1.97488 2.7933599999999998 0.38653500000000002
86 1 1.5985100000000001 2.8812000000000002 2.2288800000000002
87 1 0.46284500000000001 3.76661 22.937000000000001
88 2 15.752800000000001 1.73661 16.740600000000001
89 1 15.033899999999999 2.6949800000000002 15.4696
90 1 14.998799999999999 0.69597399999999998 18.0061
91 2 1.0868500000000001 4.8048799999999998 17.550000000000001
92 1 1.2302599999999999 6.5176400000000001 17.189800000000002
93 1 0.442882 3.9655900000000002 16.058299999999999
94 2 4.2452399999999999 23.068000000000001 6.1464999999999996
95 1 4.5604500000000003 22.587299999999999 4.4141199999999996
96 1 4.2884799999999998 1.8068200000000001 6.0832800000000002
97 2 8.8760899999999996 20.350000000000001 21.2438
98 1 8.4095899999999997 18.556999999999999 21.336300000000001
99 1 9.3341899999999995 20.768999999999998 -0.27107100000000001
100 2 1.2605500000000001 11.5184 3.75576
101 1 -0.0060600000000000003 11.8523 2.5569799999999998
102 1 2.7166100000000002 10.385400000000001 2.84429
103 2 21.119399999999999 18.853000000000002 1.2561500000000001
104 1 22.4404 18.017399999999999 2.3485900000000002
105 1 22.2773 19.497399999999999 0.028347000000000001
106 2 13.842000000000001 10.01 4.5879599999999998
107 1 13.560499999999999 9.4886400000000002 2.7970199999999998
108 1 15.4298 10.9476 4.7816700000000001
109 2 20.032299999999999 18.190999999999999 7.6835699999999996
110 1 20.9389 20.223199999999999 7.5477999999999996
111 1 21.233899999999998 17.8385 9.0412300000000005
112 2 9.7168899999999994 6.8385300000000004 8.7584900000000001
113 1 9.2112200000000009 5.1589 7.9785599999999999
114 1 9.6004199999999997 6.3682999999999996 10.559699999999999
115 2 7.6963800000000004 12.7331 14.7662
116 1 7.3576100000000002 13.795999999999999 13.268700000000001
117 1 9.4222699999999993 12.6839 15.1022
118 2 20.122800000000002 22.088200000000001 17.957899999999999
119 1 19.1126 0.304678 18.636299999999999
120 1 21.934100000000001 21.968599999999999 18.249700000000001
121 2 12.8415 19.258099999999999 5.8808800000000003
122 1 13.239100000000001 19.8367 7.6356799999999998
123 1 11.4961 18.447900000000001 6.2407700000000004
124 2 14.9815 16.554600000000001 18.867599999999999
125 1 15.806100000000001 16.940799999999999 17.265499999999999
126 1 17.521599999999999 15.5898 21.355899999999998
127 2 0.834036 11.5718 21.333100000000002
128 1 1.8740600000000001 12.8818 22.0655
129 1 1.97116 10.260199999999999 20.686599999999999
130 2 9.2284900000000007 7.4919900000000004 14.4091
131 1 9.9167799999999993 7.5134299999999996 15.6699
132 1 8.8360599999999998 9.2631099999999993 14.0289
133 2 11.2926 9.4404199999999996 23.383500000000002
134 1 11.2235 7.7517800000000001 23.410699999999999
135 1 9.5579999999999998 9.8135600000000007 0.48632599999999998
136 2 17.805299999999999 7.6026400000000001 14.9404
137 1 17.738900000000001 8.6731300000000005 13.872199999999999
138 1 18.512699999999999 6.5323500000000001 13.7552
139 2 13.8725 6.8589799999999999 18.3369
140 1 14.7568 7.7961799999999997 16.747599999999998
141 1 14.2013 4.9680799999999996 18.401
142 2 23.0228 11.8605 8.8937100000000004
143 1 23.116099999999999 11.4161 7.14086
144 1 21.6477 13.0159 8.9233600000000006
145 2 8.7789699999999993 1.75627 9.8980899999999998
146 1 10.4735 1.3146899999999999 10.7338
147 1 8.6431199999999997 0.58794000000000002 8.4552099999999992
148 2 16.2333 10.4991 21.402000000000001
149 1 15.495200000000001 9.1909200000000002 20.715499999999999
150 1 14.9786 11.777100000000001 21.8293
151 2 19.203800000000001 2.22892 4.6112099999999998
152 1 18.985499999999998 1.41039 3.15361
153 1 19.7928 1.4284600000000001 6.0571200000000003
154 2 5.6809399999999997 3.0451199999999998 13.1189
155 1 7.1078200000000002 2.3069199999999999 12.010300000000001
156 1 6.3725800000000001 4.65273 13.7621
157 2 4.4194199999999997 19.215199999999999 10.3683
158 1 5.1798700000000002 20.2973 9.0558099999999992
159 1 5.5698600000000003 17.738399999999999 10.7018
160 2 14.1355 1.4335500000000001 10.208299999999999
161 1 13.9558 2.3745500000000002 8.6041399999999992
162 1 15.8993 23.871400000000001 10.3802
163 2 2.6869399999999999 4.59931 5.1904700000000004
164 1 3.9838100000000001 5.8780299999999999 4.4879899999999999
165 1 1.6192 5.6871299999999998 6.0968900000000001
166 2 18.5716 4.4131600000000004 11.1807
167 1 18.793900000000001 4.3240100000000004 9.4553799999999999
168 1 16.422000000000001 3.7141500000000001 11.2049
169 2 14.390499999999999 4.1494999999999997 5.31778
170 1 16.127500000000001 4.6077500000000002 4.9381399999999998
171 1 13.5379 5.9167800000000002 5.6488399999999999
172 2 11.933199999999999 20.552900000000001 14.781700000000001
173 1 13.100099999999999 20.525500000000001 16.201599999999999
174 1 10.1973 20.4255 15.465400000000001
175 2 18.6493 12.014699999999999 3.00224
176 1 19.017399999999999 12.237500000000001 1.32315
177 1 19.223800000000001 10.126899999999999 2.9812400000000001
178 2 6.4923799999999998 14.4697 10.1922
179 1 5.3334700000000002 14.067399999999999 8.8764900000000004
180 1 7.8080600000000002 13.923400000000001 9.6611799999999999
181 2 6.55314 21.3005 15.272399999999999
182 1 6.0889600000000002 20.022300000000001 13.979799999999999
183 1 5.2370400000000004 22.694900000000001 15.271599999999999
184 2 16.850000000000001 21.171299999999999 2.4119799999999998
185 1 17.9634 19.734100000000002 2.57178
186 1 15.6921 20.852900000000002 3.6812200000000002
187 2 1.12944 21.249300000000002 21.300999999999998
188 1 2.2810199999999998 19.896999999999998 20.537099999999999
189 1 1.8933500000000001 22.6784 22.1388
190 2 2.6025700000000001 14.353199999999999 16.236699999999999
191 1 4.3327799999999996 13.993 16.651299999999999
192 1 1.8016700000000001 12.796900000000001 15.3421
I ran these 8 jobs both with and without Kokkos, like this:
Example job script
#!/bin/bash
#SBATCH --job-name=model-unwrapped_08-aug-2024.pth-frame607
#SBATCH --account=...
#SBATCH --qos=debug
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=128
#SBATCH --exclusive
#SBATCH --time=10:00
#SBATCH --error=vt_lammps%j.err
#SBATCH --output=vt_lammps%j.out
#SBATCH --mail-user=
#SBATCH --mail-type=ALL
#
#SBATCH --open-mode=append
# OpenMP parallelization
export OMP_NUM_THREADS=1
export OMP_PLACES=cores
export OMP_PROC_BIND=spread
# By default, prefer the GCC10 build of LAMMPS + pair_allegro
module load lammps-tpc/2Aug23/gcc10-allegro-cpu
# Ensure that stack size is unlimited, or you may get a segfault error when
# attempting to run a MPI job.
ulimit -s unlimited
ulimit -S unlimited
ulimit -H unlimited
srun lmp -in input.lammps
# or "srun lmp -k on -sf kk -pk kokkos neigh full -in input.lammps", if running with Kokkos.
Nearly all cases results in the Torch reshape error, after the simulation has proceeded for some number of steps. (The one case where it has not is a segfault.)
| Last MD or minimization step | Torch reshape error? | |
|---|---|---|
| model-wrapped_08-aug-2024.pth-frame607-kokkos | 321 | True |
| model-wrapped_08-aug-2024.pth-frame607 | 1741 | True |
| model-wrapped_08-aug-2024.pth-frame596-kokkos | 31 | True |
| model-wrapped_08-aug-2024.pth-frame596 | 454 | True |
| model-wrapped_08-aug-2024.pth-frame1351-kokkos | 270 | True |
| model-wrapped_08-aug-2024.pth-frame1351 | 616 | True |
| model-wrapped_08-aug-2024.pth-frame1252-kokkos | 179 | True |
| model-wrapped_08-aug-2024.pth-frame1252 | 1497 | True |
| model-unwrapped_08-aug-2024.pth-frame607-kokkos | 330 | True |
| model-unwrapped_08-aug-2024.pth-frame607 | 271 | True |
| model-unwrapped_08-aug-2024.pth-frame596-kokkos | 31 | True |
| model-unwrapped_08-aug-2024.pth-frame596 | 1744 | True |
| model-unwrapped_08-aug-2024.pth-frame1351-kokkos | 290 | False |
| model-unwrapped_08-aug-2024.pth-frame1351 | 754 | True |
| model-unwrapped_08-aug-2024.pth-frame1252-kokkos | 182 | True |
| model-unwrapped_08-aug-2024.pth-frame1252 | 1734 | True |
Additionally, I examined the domain decomposition chosen for one typical job, using the X, Y, and Z cut locations printed to screen by the LAMMPS balance command to count how many atoms were in domain at each frame of the simulation. While the precision of these cut points undoubtedly causes some rounding error, I was surprised to find that 32 of the 128 domains were empty at the first frame of MD simulation. So it may be more complex than a domain simply going empty at a later point in the simulation.
However, my limited testing does support the idea that empty domains make the simulation more likely to crash from a Torch reshape error. On another system I'm researching with ~500 atoms (water solvated transition metal atoms), I see approximate inverse proportionality in the number of steps my MD simulation will complete before a Torch reshape crash and the number of domains:
- 64 MPI tasks: 845 steps
- 32 MPI tasks: 1644 steps
- 16 MPI tasks: O(5000) steps by the time the job times out at 10 min.
This happens even despite the fact that using fewer domains for some reason tends to produce different results for the pre-MD conjugate gradient minimization. 16 MPI tasks produce a minimized geometry where atoms concentrated in one half of the square box, while larger numbers of MPI tasks produce more uniformly distributed geometry, so the fact that the 16-task job does not produce a Torch reshape error even at O(5000) steps makes empty domains seem a more likely cause of this error.
I'm not sure what else to try. I've tried things like forcing a domain rebalance after each MD step and increasing the neighbor list and ghost atom communication cutoffs, but I'm still encountering Torch reshape errors for all but the smallest number of domains.
Do you have any guidance on what to try, or tests you'd like me to run?
Thanks!
Edit: running larger systems with the same number of domain decompositions seems to work. The water system above is 64 waters in a box. If I instead replicate to 3 copies in each dimension (1728 waters), I can run 20k steps at 128 MPI tasks with no Torch reshape error, both with and without Kokkos.