OPTIMADE icon indicating copy to clipboard operation
OPTIMADE copied to clipboard

Add `bonds` for structure type entries

Open merkys opened this issue 1 year ago • 10 comments

In issue #426 I proposed adding more chemical properties to OPTIMADE structures. This PR implements my suggestion on representation of chemical connectivity between pairs of sites in OPTIMADE:

{
    "bonds": [ {"sites": [1, 2]} ]
}

I intentionally omit the bond types as this might be difficult to agree upon, whereas having just the connectivity is already beneficial.

Pinging people who have expressed their interest for comments: @eimrek @BobHanson @Austin243

Edit: I have introduced means to express connections with translation equivalents of the sites in sites.

merkys avatar Jun 06 '23 11:06 merkys

It has been pointed out that bonds might cross the unit cell and this should as well be reflected. I will update the PR to include this piece of information.

Edit: Added a way to describe translations.

merkys avatar Jun 07 '23 13:06 merkys

Pinging @d-beltran for comments on how this proposal suits macromolecules.

merkys avatar Jun 08 '23 07:06 merkys

Pinging @utf who participated in the discussions.

merkys avatar Jun 08 '23 07:06 merkys

Works for me :)

I intentionally omit the bond types as this might be difficult to agree upon

The type of bond is not specified in most topology formats in our field but I think this is very inconvenient. So happy to know this is provisional and the type will be specified in a future.

d-beltran avatar Jun 08 '23 12:06 d-beltran

@eimrek suggested leaving translation vector only for one of the sites as at least one site will stay in the unit cell, or can be translated there. How about this:

{
    "sites": [63, 64],
    "translation_site": 1, // the second of the two sites is translated
    "translation_vector": [0, 0, 1]
}

The translation_site is required as any of the two sites could be translated, and the order of indexes has to be respected in sites. Or is it better to have a more structured way: "translation": { "site": 1, "vector": [0, 0, 1] }?

merkys avatar Jun 08 '23 12:06 merkys

@merkys I would leave at least the option to provide both of the translations. In the COD there are multiple non-polymeric molecules that spans several unit cells, so both translation vectors will be needed to correctly represent the complete molecule. I attach an example of such molecule to this comment (1540421.cif.txt, remove the txt extension before viewing), but @sauliusg could probably provide an even more extreme example (I seem to recall a molecule that spans 5 unit cells).

vaitkus avatar Jun 08 '23 21:06 vaitkus

Workshop: We are happy to merge an explicit bond strucutre datastructure, but we must consider the exact format so the types of queries that one wants to do can be performed (with the present filter language, preferably). Inheriting the current CIF framework for this should be seriously considered.

rartino avatar Jun 09 '23 07:06 rartino

@merkys I would leave at least the option to provide both of the translations. In the COD there are multiple non-polymeric molecules that spans several unit cells, so both translation vectors will be needed to correctly represent the complete molecule. I attach an example of such molecule to this comment (1540421.cif.txt, remove the txt extension before viewing), but @sauliusg could probably provide an even more extreme example (I seem to recall a molecule that spans 5 unit cells).

It is always possible to back-translate one of the sites into the primary unit cell without losing the connectivity information. Having both non-zero translation vectors is a matter of convenience, I think, or does this retain some more information?

merkys avatar Jun 12 '23 07:06 merkys

Workshop: We are happy to merge an explicit bond strucutre datastructure, but we must consider the exact format so the types of queries that one wants to do can be performed (with the present filter language, preferably). Inheriting the current CIF framework for this should be seriously considered.

Unless we introduce some data redundancy here, the most powerful queries would be the ones based on correlated arrays (a.k.a. zips), as OPTIMADE does not support anything more intricate than that. To make bonds more friendly for zips, it should be defined as an array correlated with species_at_sites, possibly listing all neighbours of each site (full symmetric matrix). But again, species_at_sites is just an array of labels, thus one could not reliably query for, for example, structures where O atom has three bonds or more.

merkys avatar Jun 12 '23 07:06 merkys

It is always possible to back-translate one of the sites into the primary unit cell without losing the connectivity information. Having both non-zero translation vectors is a matter of convenience, I think, or does this retain some more information?

Well, if you back-translate sites into the primary unit cell you lose some information and end up with a set of disjointed bonded fragments. You could, of course, translate these fragments from the primary unit cell back into their proper place, however, it is not ye obvious to me that this is a straightforward task. What is the drawback of allowing to specify both sites?

vaitkus avatar Jun 12 '23 08:06 vaitkus