OPTIMADE icon indicating copy to clipboard operation
OPTIMADE copied to clipboard

Extend site_coordinate_span with space-filling octahedra

Open rartino opened this issue 3 months ago • 10 comments

Upon discussing with @d-beltran @sauliusg we find that the unit cells actually used in life sciences MD simulations is often non-parallelepiped, i.e., they use other space-filling octahedra.

This is actually important to be able to represent relevant MD data. We figure that it means we need to extend the newly added site_coordinate_span field in structures; so that it also gets "inherited" into the trajectory endpoint.

It may be enough if we can use that field to specify one of the parallelohedron and say that when doing this, it is meant to go along with an orthogonal unit cell that then describes the smallest enclosing rectangular cuboid. At least this seems to cover the possibilities in GROMACS, and match the way these cells are specified there.

rartino avatar Sep 17 '25 12:09 rartino

Meant to mention this one in the meeting, do you want to add these options before v1.3 @rartino ?

ml-evs avatar Nov 04 '25 17:11 ml-evs

I suggest we bring this up for discussion during the next web meet. If we are extending the values available for site_coordinate_span it seems smart to just have them all in the first release with that field.

rartino avatar Nov 05 '25 23:11 rartino

I am positive about extending the site_coordinate_span. If we just add extra possibilities then it should work just fine, in a backwards compatible way. I would include all possible parallelohedra since there is a small finite number of them. The unit cell the wold give translation vectors that translate this shape to fill the whole space in a periodic way.

Let's discuss during the next meeting.

sauliusg avatar Nov 06 '25 07:11 sauliusg

I started digging a bit to look back into the question if just specifying a space-filling parallelohedra is sufficient or if there are degrees of freedom that needs to be specified, and I found that GROMACS documentation states that these options are "special cases of a triclinic box.". And as I dig a bit deeper, I'm not sure these simulations work the way I understood it when we discussed this at the workshop.

I think these box settings are just shorthand settings for selecting a normal "triclinic cell" in the sense of a cell that can be described by three vectors (i.e., a general parallellepiped). For the parallelohedra commonly used in simulation, they specify lattice vectors with lengths and angles that, for us coming from crystallographic terminology, are the primitive cells of an FCC lattice (where the Wigner-Seitz cell is a rhombic dodecahedron) or a BCC lattice (truncated octahedron). If I've understood things right now, the simulations do not enforce periodic boundary conditions on the sides of these parallelohedra, they are in practice still carried out on the "triclinic cell", i.e., a parallellepiped, just like we do in solid state. Meaning: I'm starting to think the current definition of site_coordinate_span along with lattice_vectors is sufficient to represent what is being done here. However, we could add an extra field somewhere to clarify the symmetry of the lattice, so this doesn't need to be determined from lattice_vectors.

Maybe @d-beltran can help shine some light on this: does the above make sense to you, or am I on the wrong track now?

rartino avatar Nov 06 '25 08:11 rartino

I am afraid I lack the geometry background necessary to fully catch up with this discussion, but if I am not wrong we concluded in the workshop that rhombic dodecahedron and truncated octahedron boxes can be described with 3 vectors (thus falling in the definiton of triclinic boxes?) so the current specification may be enough to describe them. However they are not parallellepiped, am I right?

d-beltran avatar Nov 06 '25 10:11 d-beltran

I started digging a bit to look back into the question if just specifying a space-filling parallelohedra is sufficient or if there are degrees of freedom that needs to be specified, and I found that GROMACS documentation states that these options are "special cases of a triclinic box.". And as I dig a bit deeper, I'm not sure these simulations work the way I understood it when we discussed this at the workshop.

Yes, I think you are right – the proposed polyhedra are different fundamental domains for a triclinic (P1) lattice.

I think these box settings are just shorthand settings for selecting a normal "triclinic cell" in the sense of a cell that can be described by three vectors (i.e., a general parallellepiped). For the parallelohedra commonly used in simulation, they specify lattice vectors with lengths and angles that, for us coming from crystallographic terminology, are the primitive cells of an FCC lattice (where the Wigner-Seitz cell is a rhombic dodecahedron) or a BCC lattice (truncated octahedron).

I think I a sort of understand what you mean, but the FCC and BCC are cubuc lattices (with extra symmetry and angle-length constraints on unit cell vectors), and here we do not have these constraints – the lattice can be, in general, triclinic.

If I've understood things right now, the simulations do not enforce periodic boundary conditions on the sides of these parallelohedra,

From what I understood when reading the GROMACS manual, there are always periodic boundary conditions, and the "funny" shapes ( rhombic dodecahedron truncated octahedron) are chosen only to make the simulation cell closer to the molecule shape. Still need to do the math properly...

they are in practice still carried out on the "triclinic cell", i.e., a parallellepiped, just like we do in solid state.

Triclinic cell does not exclude periodic boundary conditions. Also, they (GROMACS) use rectangular parallelipiped for the simulation box, even though the unit cell defined by the translation vectors is not rectangular...

See https://manual.gromacs.org/current/reference-manual/algorithms/periodic-boundary-conditions.html .

In fact, any "honeycomb" lattice (i.e. the tesselation were vertices, edges and faces of adjacent polyhedra match exactly) can be used with the periodic boundary conditions, no?

Meaning: I'm starting to think the current definition of site_coordinate_span along with lattice_vectors is sufficient to represent what is being done here. However, we could add an extra field somewhere to clarify the symmetry of the lattice, so this doesn't need to be determined from lattice_vectors.

Well, in principle, yes; but maybe specifying the simulation box for the trajectories would still be useful?

UPD: actually, we probably need explicit formulation of the represented box, because the algorithm of how to cover the closest neighbours will depend on the box. Imagine a hexagonal honeycomb on the plane vs a square grid...

Maybe @d-beltran can help shine some light on this: does the above make sense to you, or am I on the wrong track now?

sauliusg avatar Nov 07 '25 13:11 sauliusg

@sauliusg

Triclinic cell does not exclude periodic boundary conditions.

A quick clarification: all these cases uses periodic boundary conditions. The question is what periodic boundaries are enforced: is it on the "triclinic" parallelepiped cell spanned by three lattice vectors, or on the rhombic dodecahedron (or truncated octahedron)? I think at least in GROMACS it is the former.

I think I a sort of understand what you mean, but the FCC and BCC are cubuc lattices (with extra symmetry and angle-length constraints on unit cell vectors), and here we do not have these constraints – the lattice can be, in general, triclinic.

They are cubic lattices, but their primitive unit cells are not cubic, formally I think they are rhombohedral. I strongly suspect the so called "triclinic cell" used when asking GROMACS to use a rhombic dodecahedron box is, in fact, that rhombohedral cell. I.e., if we look in the output files we will find that it says the simulation box was spanned by three lattice vectors that span a rhombohedron.

Maybe the simplest next step is to download one of the trajectories in MDDB marked rhombic dodecahedron or truncated octahedron and check the simulation box used, as well as if the coordinates are capped to stay within this rhombohedron or rather the dodecahedron/octahedron.

Well, in principle, yes; but maybe specifying the simulation box for the trajectories would still be useful?

My point is that lattice_vectors may still be sufficient for this, if these "special boxes" are only used as syntactic sugar to specify a cell spanned by three lattice vectors.

rartino avatar Nov 07 '25 14:11 rartino

Based on some info shared by @d-beltran, my conclusion is that this setting in, e.g., GROMACS is only for choosing appropriate lengths and angles for the usual parallelepiped simulation box such that the user "in post" can map positions into these shapes. Hence, there is no need to add these to site_coordinate_span and we can close this issue. But, I'll leave it open for now and we can discuss on the web meet on Friday.

rartino avatar Dec 04 '25 09:12 rartino

I think a similar analysis has been floated here before. To verify and contribute to the discussion, I also asked the MD specialists in FAIRmat, and am copying their answers below verbatim:

Nathan Daelman — 11/7/25, 6:00 PMFriday, November 7, 2025 at 6:00 PM @Bernadette Mohr @Joseph Rudzinski or anyone else: this has come up a few times now in the OPTIMADE discussions, but apparently GROMACS uses a particular alternative way for unit-cell representations, i.e. octahedrons, dodecahedrons. How do you handle this within NOMAD? Any insight on this choice? Typical lattice representations of crystal systems are either parallelepipeds or hexagonal. (edited)Friday, November 7, 2025 at 6:01 PM

Nathan Daelman — 11/7/25, 6:33 PMFriday, November 7, 2025 at 6:33 PM This is the OPTIMADE PR in question: https://github.com/Materials-Consortia/OPTIMADE/issues/552 Feel free to chime in

Bernadette Mohr — 11/10/25, 10:29 AMMonday, November 10, 2025 at 10:29 AM It is not just GROMACS, MD suites geared towards macromolecular simulations relatively commonly offer these box shapes. It is simply a matter of keeping enough padding around the simulated molecules, while still limiting the box sizes as much as possible to keep the computational load manageable.

As far as I am aware, NOMAD does not fully support these types of boxes yet.

Nathan Daelman — 11/10/25, 10:34 AMMonday, November 10, 2025 at 10:34 AM

It is simply a matter of keeping enough padding around the simulated molecules, while still limiting the box sizes as much as possible to keep the computational load manageable.

This was one of the possibilities raised too, but the GROMACS docs seemed to hint at other reasons too. The padding here is of course necessary to preserve translational symmetry, as with crystals.

[10:35 AM]Monday, November 10, 2025 at 10:35 AM

As far as I am aware, NOMAD does not fully support these types of boxes yet.

Part of the question is whether OPTIMADE should, or should at least acknowledge when these were used. So in NOMAD, do you convert them back to the more crystal-aligned boxes?

Bernadette Mohr — 11/10/25, 11:18 AMMonday, November 10, 2025 at 11:18 AM

So in NOMAD, do you convert them back to the more crystal-aligned boxes?

Yes, I think this is what is currently happening. @Joseph Rudzinski , please correct me if I'm not aware of something.

Part of the question is whether OPTIMADE should, or should at least acknowledge when these were used.

I would say the box shape is a relevant simulation parameter, and therefore, the information of the original box type should be preserved. On the other hand, this is mostly relevant for biomolecular simulations. As long as it affects only a small fraction of our user base, it is probably not on the top of our priority list to accommodate.

Nathan Daelman — 11/10/25, 11:29 AMMonday, November 10, 2025 at 11:29 AM Well, this was more so prompted for the OPTIMADE community at this point. Pls feel free to tell them this exact point in the issue I linked

Joseph Rudzinski — 11/10/25, 12:39 PMMonday, November 10, 2025 at 12:39 PM Gromacs (and other MD software) use non-parallelepipeds box shapes to minimize the amount of solvent that needs to be simulated when investigating large macromolecules, e.g., proteins. Dodecahedron, e.g., is the most widely accepted shape for large protein simulations.,

Our Gromacs and Lammps parsers rely on MDAnalysis to extract/convert the box info to lattice vectors (although we could also grab this info from the raw files if need be). However, I believe that in Gromacs you do not provide input to define an actual dodecahedron, but rather you supply triclinic vectors (or equivalent matrix) which is then converted to dodecahedron playing some games with the periodic images (don't quote me on that).,

In any case, we would have to look into this more if we want to support properly. Atm I don't think it really affects much, the visualizer prob. still works, and in pricipal the information is there, but we should (eventually) make it more explicit in the schema I guess

ndaelman-hu avatar Dec 08 '25 17:12 ndaelman-hu

@ndaelman-hu

Thanks for sharing this conversation.

However, I believe that in Gromacs you do not provide input to define an actual dodecahedron, but rather you supply triclinic vectors (or equivalent matrix) which is then converted to dodecahedron playing some games with the periodic images (don't quote me on that).,

I think this is what I was trying to say above and on the last web meet about GROMACS not "really" running the simulation in an actual dodecahedron cell. Next time I meet someone with deeper insights into GROMACS internals I'll ask about this. However, it is very relevant for this topic if there are other popular macromolecular MD codes that doesn't use this "trick" but runs the simulation in a dodecahedron with periodic boundaries on its surfaces. If the latter is somewhat common, it may still be meaningful for OPTIMADE to preserve the info.

@d-beltran Do you know what other of the more popular MD codes for macromolecular simulations that support dodecahedrons? I tried looking through the MDDB data I could find online (BioExcel COVID-19, MDposit) but have so far only seen GROMACS used for dodecahedron simulations.

rartino avatar Dec 09 '25 07:12 rartino