OPTIMADE icon indicating copy to clipboard operation
OPTIMADE copied to clipboard

Distinguish experimental structures from theoretical

Open merkys opened this issue 2 years ago • 42 comments

As suggested by @BobHanson, there should be standard means to distinguish between experimental and theoretical structures. This could be a property with boolean/enum values. I would suggest "MUST" level of support (maybe even for queries), as I believe this bit of information should always be available.

merkys avatar May 31 '22 08:05 merkys

Could we get away with defining a new enum value in structure_features for this?

ml-evs avatar May 31 '22 08:05 ml-evs

Could we get away with defining a new enum value in structure_features for this?

Sounds good to me. Should it be theoretical for theoretical structures? Or rather experimental for experimental ones? Which is more natural? For me it is theoretical, but I come from experimental background, hence my bias :smile:

merkys avatar May 31 '22 08:05 merkys

I think we will need to have both theoretical and experimental. That way we can implicitly also have the case where it is undefined, which may be useful for old entries for which it was not recorded whether it is a theoretical or experimental structure, or simply for databases that have not been updated.

What you are proposing does stretch the meaning of the structure_features field, which is defined as: A list of strings that flag which special features are used by the structure.

Perhaps this is also a good moment to think about how we want to include more detailed information about how the structure was generated. Especially information that would be interesting for ~~generating~~ querying structures. X-ray scattering or neutron scattering, ab initio calculations. The software package that was used. etc.

JPBergsma avatar May 31 '22 11:05 JPBergsma

It is definitely a property of a data element (one element of the array, as opposed to the overall set of records). I agree that it is not something to add on to some existing "structure features" string. It's more important than that. How about a new key called "nature" within data:

data[i].nature: {"experimental"|"theoretical"}

BobHanson avatar May 31 '22 13:05 BobHanson

Reading @JPBergsma and @BobHanson responses I am now leaning towards separate property. It could actually provide more information about the origin of a structure. In the COD, we have a CIF data item _cod_struct_determination_method with the following possible values: single crystal, powder diffraction and theoretical. Maybe something similar could be introduced into OPTIMADE.

merkys avatar May 31 '22 13:05 merkys

This sounds great to me. But can you have theoretical PD? Re there two concepts here?

data[i].nature: {"experimental"|"theoretical"} data[i].method: {"single crystal diffraction"|"powder diffraction"}

BobHanson avatar May 31 '22 13:05 BobHanson

This sounds great to me. But can you have theoretical PD? Re there two concepts here?

data[i].nature: {"experimental"|"theoretical"} data[i].method: {"single crystal diffraction"|"powder diffraction"}

Right. Then these should be separate properties.

merkys avatar May 31 '22 14:05 merkys

How should we name such a property? Some suggestions:

  • nature
  • origin
  • determination_method.

Personally, nature does not sound immediately clear to me, origin might also be quite ambiguous.

merkys avatar Jun 01 '22 15:06 merkys

Personally, nature does not sound immediately clear to me, origin might also be quite ambiguous.

Yes, I also would not know a good name for this distinction. From the suggestions above I found determination_method the clearest. But perhaps we can also name it simply experimental_or_theoretical .

JPBergsma avatar Jun 01 '22 16:06 JPBergsma

experimental_method?

On Wed, Jun 1, 2022 at 6:14 PM Johan Bergsma @.***> wrote:

Personally, nature does not sound immediately clear to me, origin might also be quite ambiguous.

Yes, I also would not know a good name for this distinction. From the suggestions above I found determination_method the clearest. But perhaps we can also name it simply experimental_or_theoretical .

— Reply to this email directly, view it on GitHub https://github.com/Materials-Consortia/OPTIMADE/issues/406#issuecomment-1143812327, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHNCW5BWGG7JETSFPHBJBDVM6D7LANCNFSM5XMWYTBA . You are receiving this because you were mentioned.Message ID: @.***>

-- Robert M. Hanson Professor of Chemistry St. Olaf College Northfield, MN http://www.stolaf.edu/people/hansonr

If nature does not answer first what we want, it is better to take what answer we get.

-- Josiah Willard Gibbs, Lecture XXX, Monday, February 5, 1900

We stand on the homelands of the Wahpekute Band of the Dakota Nation. We honor with gratitude the people who have stewarded the land throughout the generations and their ongoing contributions to this region. We acknowledge the ongoing injustices that we have committed against the Dakota Nation, and we wish to interrupt this legacy, beginning with acts of healing and honest storytelling about this place.

BobHanson avatar Jun 01 '22 20:06 BobHanson

@JPBergsma: experimental_or_theoretical will need renaming should more enumerator values be introduced (i.e., mixed or something else).

@BobHanson: "experimental_method": "experimental" sounds slightly wonky to me.

merkys avatar Jun 02 '22 07:06 merkys

Ah, right. This was in reference to

experimental_method: {single crystal diffraction | powder diffraction|...} investigation_type: {experimental | theoretical}

brainstorming...

BobHanson avatar Jun 02 '22 11:06 BobHanson

Following the discussion with @sauliusg I can also point out many edge cases where the experimental or theoretical nature is not immediately clear. An example is de-novo crystal structure refinement, see [1], [2], and many more.

blokhin avatar Jun 03 '22 12:06 blokhin

cf. computational experiments vs. experimental modeling

blokhin avatar Jun 03 '22 12:06 blokhin

not voting for "computational experiment". I understand the desire to consider computational approaches "experiments" but I think this is not well understood.

On Fri, Jun 3, 2022 at 2:54 PM Evgeny Blokhin @.***> wrote:

cf. computational experiments vs. experimental modeling

— Reply to this email directly, view it on GitHub https://github.com/Materials-Consortia/OPTIMADE/issues/406#issuecomment-1145934888, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHNCW4SIE52SO7NPZNI6XLVNH6BXANCNFSM5XMWYTBA . You are receiving this because you were mentioned.Message ID: @.***>

-- Robert M. Hanson Professor of Chemistry St. Olaf College Northfield, MN http://www.stolaf.edu/people/hansonr

If nature does not answer first what we want, it is better to take what answer we get.

-- Josiah Willard Gibbs, Lecture XXX, Monday, February 5, 1900

We stand on the homelands of the Wahpekute Band of the Dakota Nation. We honor with gratitude the people who have stewarded the land throughout the generations and their ongoing contributions to this region. We acknowledge the ongoing injustices that we have committed against the Dakota Nation, and we wish to interrupt this legacy, beginning with acts of healing and honest storytelling about this place.

BobHanson avatar Jun 03 '22 13:06 BobHanson

experimental_or_theoretical will need renaming should more enumerator values be introduced (i.e., mixed or something else).

Yes, you are right, that is not convenient.

How about method_class or method_category ? That allows another field named method that holds a more specific term for the procedure used to generate the data.

JPBergsma avatar Jun 05 '22 14:06 JPBergsma

Indeed, there is a whole spectrum of methods ranging from purely experimental (can we actually get coordinates without any theoretical assumptions?) to purely theoretical. We probably would need a separate ontology just to identify where a structure sits in that spectrum.

merkys avatar Jun 06 '22 12:06 merkys

But for our purposes suggest not reinventing the wheel or overcomplicating. Go with the ICSD conception here. Keep it simple. Maybe allow for some ambiguous third category but don't insist that every conceivable possibly is covered.

On Mon, Jun 6, 2022, 2:03 PM Andrius Merkys @.***> wrote:

Indeed, there is a whole spectrum of methods ranging from purely experimental (can we actually get coordinates without any theoretical assumptions?) to purely theoretical. We probably would need a separate ontology just to identify where a structure sits in that spectrum.

— Reply to this email directly, view it on GitHub https://github.com/Materials-Consortia/OPTIMADE/issues/406#issuecomment-1147373570, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHNCW4I4HLX22N2WPIEFZTVNXSKJANCNFSM5XMWYTBA . You are receiving this because you were mentioned.Message ID: @.***>

BobHanson avatar Jun 06 '22 12:06 BobHanson

@BobHanson Is there a link for the said ICSD conception?

merkys avatar Jun 06 '22 12:06 merkys

I understand the desire to add something ASAP to help distinguish experimental and theoretical structural data. However, I'd suggest to be careful to not over-design this interface, since it is debatable if this info even belongs in this endpoint. Going forward, we won't be able to stuff all possible experimental and theoretical details related to a structure into the structure endpoint and, at least for theoretical structures, I believe our consensus is that things like "which method", etc., belongs in the calculations endpoint with relationships to "input" and "output" structures.

Hence, I suggest this to just be a simple boolean field: experimental that is defined to be True if, and only if, the structural data, including the atomic coordinates, represented by the structure have been obtained more or less directly out of an experiment and thus the crystal structure reasonably can be understood to have been observed in nature.

The alternative, False, just means that the structure has been obtained some other way. E.g., hypothetical structures through substitutions (perhaps including DFT relaxations, etc., but not necessarily), structure prediction algorithms, just random initialization, etc. No guarantees that these structures "make sense".

(Or is there a very strong desire to also distinguish theoretical structures that the database strongly believes are at, or very close, to the convex hull of stability? This, I believe, is the ICSD criterion for inclusion.)

rartino avatar Jun 07 '22 08:06 rartino

Unfortunately it starts to be complicated here. Imagine we took an experimental structure and relax it fully with the DFT, ending up with the different cell, symmetry, atomic positions, etc. Is the structure still experimental?

blokhin avatar Jun 07 '22 11:06 blokhin

@blokhin

Unfortunately it starts to be complicated here. Imagine we took an experimental structure and relax it fully with the DFT, ending up with the different cell, symmetry, atomic positions, etc. Is the structure still experimental?

It was my intent to mostly avoid this complexity by a single stringent definition separating everything into "directly from experiment" vs. other. My definition above was meant to say that your example is not an experimental structure.

rartino avatar Jun 07 '22 11:06 rartino

Following the discussion with @sauliusg I can also point out many edge cases where the experimental or theoretical nature is not immediately clear. An example is de-novo crystal structure refinement, see [1], [2], and many more.

I think my suggestion was for actual vs hypothetical which maybe makes this slightly clearer (though shifts the vagueness elsewhere, e.g. whether a DFT database that simply took experimental structures and calculated band gaps without relaxing should report itself as hypothetical or actual).

The two relevant axes for filtering seem to me to be whether something has actually been made, and whether the structure is simply the result of minimising or sampling of a Hamiltonian

ml-evs avatar Jun 07 '22 16:06 ml-evs

Sorry -- that ICSD paper reference: https://journals.iucr.org/j/issues/2019/05/00/in5024/index.html and supporting information

Noting that there is a discussion of this in matsci.org https://matsci.org/t/how-is-the-theoretical-tag-determined/3527

So perhaps the boolean "theoretical" is appropriate (matching ICSD). But this post does point out the same issue -- that it is not always possible to distinguish. I think one would just have to trust repositories to do their best job here. AFLOW could distinguish (perhaps?) between their ICSD entries (which are presumably NOT theoretical) from their calculations. @ @.*** (Cormac)

I do feel strongly that there MUST be some sort of flag regarding this. Serving up purely calculated structures is not the same as delivering x-ray crystallographic results. This is a widespread, growing issue throughout the data world. My recommendation: keep it simple.

Bob

BobHanson avatar Jun 08 '22 04:06 BobHanson

Having read the discussion, I tend to agree with those of you favoring single boolean flag. The question now is where to draw the line. However, neither ICSD paper nor related discussion on matsci.org does provide clear criteria (thanks @BobHanson for links, though). @vaitkus, maybe IUCr has put up any criteria?

I am a bit skeptical regarding the structures relationships with calculations though. In TCOD we have theoretically calculated structures from journal publications, but usually machine-readable metadata related to actual calculations is scarce (but reported in human-readable publications). Thus if calculations entries become mandatory for theoretical structures, we would not be able to return much meaningful data in them.

merkys avatar Jun 09 '22 06:06 merkys

@merkys, as far as I know, the IUCr does not have any such criteria.

However, the ICSD paper lists three types of subclasses of theoretical structures:

  • Predicted (non-existing) crystal structure.
  • Optimized (existing) crystal structure.
  • Combination of theoretical and experimental structure.

Based on this, I would say that according to them anything that is not purely experimental is classified as theoretical.

vaitkus avatar Jun 27 '22 11:06 vaitkus

Based on this, I would say that according to them anything that is not purely experimental is classified as theoretical.

I think there might be difficulties in drawing the line between refinement with statistical potentials, forcefields and DFT.

merkys avatar Jun 27 '22 12:06 merkys

@merkys

I am a bit skeptical regarding the structures relationships with calculations though. In TCOD we have theoretically calculated structures from journal publications, but usually machine-readable metadata related to actual calculations is scarce (but reported in human-readable publications). Thus if calculations entries become mandatory for theoretical structures, we would not be able to return much meaningful data in them.

I don't think anyone proposed to make them mandatory for theoretical entries? Just that if you have data or metadata related to the calculation itself for, say, a calculation that started from one structure, and resulted into a couple of output structures, that data would better belong under the calculations endpoint than being stored alongside the structures. For example, if you want to provide details on cutoffs, k-point grids, DFT functionals, etc.

rartino avatar Jun 28 '22 07:06 rartino

I don't think anyone proposed to make them mandatory for theoretical entries?

No. I mistakenly assumed this was the suggested solution for telling experimental structures from theoretical.

Just that if you have data or metadata related to the calculation itself for, say, a calculation that started from one structure, and resulted into a couple of output structures, that data would better belong under the calculations endpoint than being stored alongside the structures. For example, if you want to provide details on cutoffs, k-point grids, DFT functionals, etc.

Agree.

merkys avatar Jun 28 '22 09:06 merkys

Q1: Are theoretical and experimental the correct two options?

I suggest yes:

There is a paper from ICSD: Recent developments in the Inorganic Crystal Structure Database: theoretical crystal structure data and related features http://scripts.iucr.org/cgi-bin/paper?in5024, where, for example, we see:

In order to be included in the ICSD, a theoretical structure has to be fully characterized, the atomic coordinates determined and the composition fully specified, similarly to* experimental structures*.

Table 1 Comparison of databases containing experimental and/or theoretical crystal structures (14 uses of "experimental structure") (26 uses of "theoretical structure")

So, I argue, these are the terms to use.

As for

xxx_yyy = { experimental | theoretical }

I suggest NOT using "structure_type" as that actually means something different.

Maybe "determination_type"

"experimentally determined structure" Google 20,000 hits.

admittedly,

"theoretically determined crystal structure" has only 3 hits. So many that is a bit of a problem.

Next idea?

BobHanson avatar Oct 11 '22 09:10 BobHanson