aiida-common-workflows icon indicating copy to clipboard operation
aiida-common-workflows copied to clipboard

Enriched StructureData or similar

Open espenfl opened this issue 3 years ago • 23 comments

For many codes it makes sense to have some kind of data structure that is a bit more general than StructureData. Say to house selective dynamics flags etc. We should possibly discuss this and try to unify the approach from the start. Would it make sense to just make a new data type that is based on StructureData to add a few more common attributes?

Of course, one could argue that e.g. selective dynamic flags are not really "the structure". And it would be sufficient to have some kind of other code specific attributes linked to the structure being computed etc. Regardless of approach, we should try to agree on how we handle these extra attributes.

How do you do it in your code and how do you think it should be done on a general basis?

espenfl avatar Aug 03 '20 10:08 espenfl

So I guess we have a few options:

    1. Define a new class that i based on StructureData, say CodeStructureData:
    • pros: an extension of StructureData so natural, keeps compatibility with the current and future StructureData.
    • cons: yet another node, have to make sure StructureData is used in input/output
    1. Define new inputs on the calculation:
    • pros: easy, we do not touch StructureData
    • cons: if we decide to do this with all kinds of special parameters and attributes outside parameters (say regular input to the code) it can become a bit messy
    1. Define it in settings which is supplied to the calculation:
    • pros: as above
    • cons: maybe settings should not be used for problem specific settings?

Right now, maybe the second option would be the best approach. At least for VASP, the number of parameters that are used outside the regular input file is rather small, so I am not too worried about the input space getting too complicated. But this can be very different for other codes. Of course, for workchains, this is not a problem as we would typically set such flags on the input space of the relax or MD workchain.

espenfl avatar Aug 03 '20 11:08 espenfl

Why not increase the capabilities of the StructureData class? A StructureData node that contains all information relating to single atomic sites would be more intuitive and convenient in workchains where structures are inherited from previous calculations.

Fixed positions and (starting or final) magnetization in ASE are stored in the .traj file of ASE so the conversion from ase to StructureData would still create only one node, instead of a StructureData and some Dicts..

c-cppx avatar Aug 04 '20 14:08 c-cppx

It's an interesting point. First off, I remember that the StructureData type was originally kept very minimal because its scope is not to be a comprehensive structure manager. For that, the use of ASE or Pymatgen structures is suggested. However it's fair to discuss which structure properties is useful to have stored in the database, can they be imported directly from ASE and Pymatgen structures? Are there code dependent properties as well? We could start listing the features we would like to add:

  • FixedAtom flag for each site
  • Magnetic properties
  • ....

Once we have an idea, I think it will be easier to decide which way to take for the development.

bosonie avatar Aug 05 '20 08:08 bosonie

@c-cppx This we could do, but that would be something else and uses for instance StructureData as a base. Most likely we would not like to increase the details of StructureData. It should also work for codes that do not know what selective dynamics is. Furthermore, the term Structure pertains also to a specific structure and not how you arrived at it, so I believe the selective flags should be something separate, possibly stored in an overarching structure class. Before implementing, we should try to decide this across the different plugins to not break backwards compatibility.

@bosonie I support keeping StructureData lightweight. So we would either need a wrapper class or to pass them in using a different approach. Programmatically, maybe the cleanest would be to use a wrapper in the codes that support this functionality. Then this wrapper could be shared among the subset. But this would create a bit of overhead when we want to input StructureData and return it.

With respect to the trajectories, it does not look like TrajectoryData currently support something like this. Since this is also based on StructureData it would probably make sense to make TrajectoryData contain the wrapper as well if the additional properties are needed for the trajectories as well. Not sure what happens to the selective flags if we convert from ASE/pymatgen, but the code would reveal it.

For every case, basically, we give every position (or group of it) some property. This property can be some quantity. In fact, the StructureData is again composed of kinds, cell, positions etc. What we for instance do in the VASP parser is that we also use a class for positions, where the selective tags are attributes. Ideally, we should maybe be able to do something similar. So, possibly, if say Position class was used in StructureData we could override that in the wrapper. And then we would need something that only returns an attribute free Position when that is needed. Of course, there are many ways to do this, I am currently not sure what is the best approach. More/further comments, ideas are greatly appreciated.

@giovannipizzi have you discussed how to handle position related properties/attributes?

espenfl avatar Aug 05 '20 09:08 espenfl

There are a few things I think we should keep in mind. The StructureData class is intended and designed to store the information necessary to fully define a solid state crystal structure (it can be used for molecules, but it is not ideal for that). As such, I don't think we should start adding information that are not part of its structure definition. This, therefore, would include selective dynamic flags on the atomic positions, since these are not properties intrinsic to the structure, but rather restrictions that are imposed on top of it for the purposes of some calculation. Especially in AiiDA, this distinction of intrinsic properties versus external properties for data types is an important one when it comes to the provenance graph.

For example, imagine we were to include the selective dynamics as attributes of the StructureData class. If you have an instance of let's say crystalline silicon with no movement restriction, but then want to perform a calculation with one atom fully restrained, you would have to create a new StructureData instance, even going through a calcfunction if you even want to properly keep the provenance. However, the crystal structure really is the same. It is still crystalline silicon. I therefore feel that, within AiiDA, storing these external properties on the data type itself is the wrong solution. This stems from the fact that the data types are not merely useful classes that wrap all sorts of functionality, but they are first and foremost data containers that persist data between its transformations. This is very different from other libraries such as pymatgen and ase where the structure classes are also a data container, but are focused mostly on providing utilities to transform the data. I think we should be wary of this stark difference between AiiDA data classes and those of other libraries.

So I would also be in favor of defining some way that works on top of the StructureData class to define extrinsic properties interesting for calculations. Here we can of course make sure that the various plugins follow the same conventions and can maybe even use the same utilities to work with them.

sphuber avatar Aug 05 '20 10:08 sphuber

For example, imagine we were to include the selective dynamics as attributes of the StructureData class. If you have an instance of let's say crystalline silicon with no movement restriction, but then want to perform a calculation with one atom fully restrained, you would have to create a new StructureData instance, even going through a calcfunction if you even want to properly keep the provenance. However, the crystal structure really is the same. It is still crystalline silicon. I therefore feel that, within AiiDA, storing these external properties on the data type itself is the wrong solution. This stems from the fact that the data types are not merely useful classes that wrap all sorts of functionality, but they are first and foremost data containers that persist data between its transformations. This is very different from other libraries such as pymatgen and ase where the structure classes are also a data container, but are focused mostly on providing utilities to transform the data. I think we should be wary of this stark difference between AiiDA data classes and those of other libraries.

Yes I agree on this and the fact that we should try to leave StructureData as is and lightweight as I mentioned before. Seems we are leaning towards a wrapped extension on top of StructureData.

espenfl avatar Aug 05 '20 10:08 espenfl

Do anyone have any suggestions on how they would have liked to see this? Also, should we keep this at plugin level (and try to coordinate as best as we can), or should we introduce a plugin for this data class? I support the latter and then we can see if it makes sense to absorb it in aiida-core after a while. Comments?

espenfl avatar Aug 21 '20 08:08 espenfl

or should we introduce a plugin for this data class

As I explained in my previous post, I am not sure this should become a data plugin. Since the selective dynamics flags would be part of its attributes, they would be immutable. This means that each time you want to just change one of the flags, you would have to create an entire new node. This is a bit of a downside of AiiDA's data model.

However, I can see how one centralized class to implement the logic that can be reused by multiple plugins would be useful though. Whatever it will be, I wouldn't put it in aiida-core since anyway at some point domain specific code, including StructureData will be moved to a plugin.

With those considerations, I think we have two options:

  • Either just come up with a convention of how to specify selective dynamics in code inputs and try to apply this in all relevant plugins
  • Create a shared lower lying plugin (e.g. aiida-condensed-matter or something) that contains some utility classes (not data plugins) to help with defining these kinds of inputs and manipulating them. All plugins can then have this as a dependency and use the same code.

To be honest, at this point, I am not sure if the relative simplicity of selective dynamics does not make the latter over kill. Maybe what would help is to have a tentative example implementation to see what functionality it could provide and how much deduplication it could prevent.

sphuber avatar Aug 21 '20 10:08 sphuber

As I explained in my previous post, I am not sure this should become a data plugin. Since the selective dynamics flags would be part of its attributes, they would be immutable. This means that each time you want to just change one of the flags, you would have to create an entire new node. This is a bit of a downside of AiiDA's data model.

Yes. What I was thinking, which could be useful for other things as well is to be able to flavour a node with something that is a bit more dynamic. E.g. we put selective flavour on the StructureData in some standardized way. Thus, when the selective stuff updates, we only update the flavour (which can span several StructureData). In practice this would be another node I guess. Do similar concepts already exist?

However, I can see how one centralized class to implement the logic that can be reused by multiple plugins would be useful though. Whatever it will be, I wouldn't put it in aiida-core since anyway at some point domain specific code, including StructureData will be moved to a plugin.

* Create a shared lower lying plugin (e.g. `aiida-condensed-matter` or something) that contains some utility classes (not data plugins) to help with defining these kinds of inputs and manipulating them. All plugins can then have this as a dependency and use the same code.

Yes. This also goes for some of the other data structures that are already in aiida-core in my view, but that is something for the future. We certainly should avoid adding more domain specific stuff to it. Maybe it would make sense when we address this to also try to introduce an overarching plugin that houses some domain specific data structures. That way we can play with the concept a bit.

I will give it a try and then we can go from there.

espenfl avatar Aug 21 '20 10:08 espenfl

I think another relevant point is that we for instance would like to store this information for later. Say if we read a previous VASP calculation where selective dynamics have been performed. Then, we use that structure to do all kinds of things. Two years down the line we would like to check the selective flags. Then the input files might be gone (e.g. if retrieve_temporary_files is used) and you have no idea. So, yes, formally, the selective flags are not part of the structure and it would be nice not to make it part of StructureData. One option is to set these selective flags on the input of the immigrator calculation? But if we just want to manipulate it we would have to store it somewhere which is not on the main calculation class (we could store it on the calcfunction or something like that, but seems a bit of a hack?). With a new structure, say DynamicStructureData or whatever we want to call it, we could add such attributes and flavor StructureData. Would be somehow like TrajectoryData, but in a different context.

espenfl avatar Sep 07 '20 09:09 espenfl

I think that getting the information from the selective dynamics is quite important as sufraces, multilayers, etc, one really wants to constraint the relaxation in certain directions, and if one does not store this information somewhere it can be impossible (or very hard to know) how this structure was obtained.

I think that a derived structure could be a good idea, where more condensed matter information can be added via methods. E.g. constrains like it is done in ase, as well as information such as short-range order (if you cell comes from a SQS generation), magnetic configuration, etc. This information could then be very beneficial, either by providing information on how a calculation should be done (in the case of the constraints or magnetic configuration) or for querying purposes (such as the SRO).

I have been looking at how the TrajectoryData has been implemented, and something similar could be done with custom methods to handle these quantities as suggested by @espenfl .

The question is then more of design, should this be the approach used or use a convention like @sphuber suggests. This alternative is way simpler, but the problem with standards is that well everyone has one.

I'm willing to help with any option, just would like to ask for your opinion first.

JPchico avatar Sep 07 '20 19:09 JPchico

Thanks for you comments @JPchico. Yes, we need this for sure. It is just not in place yet as we at that point, when writing the parser and interface for the structure in VASP did not know how to approach this and wanted to postpone its decision. Now, we need to decide.

I would suggest that we first come up with a prototype of how this would look like. We might as well just put it into aiida-vasp and then others can follow. It is also about defining the standard. Also, this was the reason why I wanted to air this on the common-workflows so that we at least have some gauge on the general interest and views on this.

@sphuber Is there are reason not to follow along the line of TrajectoryData(StructureData) at the prototyping stage?

I can foresee that we most likely do want to be rather sure how to implement this in a proper release of aiida-vasp in order not to break backwards compatibility once people start to build up for production. As such, prototyping would most likely stay in the develop, or a separate branch until we figure this out. Hope this is okey for you Jonathan and any others that find this interesting.

espenfl avatar Sep 07 '20 20:09 espenfl

I think it is very important as @shuber has pointed out to separate intrinsic immutable properties from extrinsic properties which describe rather 'how something' was generated or what should be done with it. Therefore, I would not change StructureData. @espen: For now, one could use the extras of nodes for this. Which means if you have run selective dynamics you could store that info in the extras of the output structure that it was generated that way (also it is in the provenance graph) but one often wants to 'duplicate' and condense information.

Overall, in the long term there should be something like aiida-condensed-matter, containing further utils and data classes.

In the fleur case we have a data class for fleurinput and there we specific the restricted dynamics. (So it is solved in our case on a plugin level with a data plugin which also solves other issues)

Symmetry breaking, different electronic setups are a similar issues. In general the problem is how to best specify additional 'properties' of Kinds in a StructureData node.

broeder-j avatar Oct 05 '20 09:10 broeder-j

@broeder-j that is a clever solution, that is something that indeed can be used to solve the issue right now.

I think though that a aiida-condensed-matter would be very good, since there is quite a bit of extra information that one would like to have stored with the structure, that goes beyond what is currently supported. Of course one would need to see what kind of information and methods would be good to have as to ensure its adoption in as many plugins as possible

JPchico avatar Oct 05 '20 09:10 JPchico

@broeder-j Yes, I agree with this and this and is in line with the discussion up until now. In aiida-vasp we have made a choice not to introduce such a generalization of the class in order to be as true to AiiDA as possible. Your suggestion of using extras is nice and as @JPchico said, it is something that makes it possible to solve the issue now. However, for the release etc. we should be careful introducing such functionality due to backwards compatibility. What we could do is to make a wrapper function to return such properties and then we just change the backend when we have made a generalization.

I think it would be nice if someone made a prototype GeneralStructureData(StructureData) or similar and then we can perturb it a bit in order to arrive at something that is useful across several plugins. Unfortunately I am a bit pressed on time and we have a few other issues that needs to be resolved as well. If someone else would like to pursue such a prototype that would be greatly appreciated. I think it would make total sense to house it in this repo for now.

espenfl avatar Oct 05 '20 10:10 espenfl

@espenfl My plan is to put sometime into the GeneralStructureData or something similar this week. I was planning to do it before, but I had other tasks that needed to be prioritized. But I think that I should be able to finish those today and then I can dedicate some time to prototyping

JPchico avatar Oct 05 '20 10:10 JPchico

@JPchico Awesome, looking forward to it. Thanks.

espenfl avatar Oct 05 '20 10:10 espenfl

So I finally had sometime to dedicate to this. I was thinking that an approach like the one used in the TrajectoryData seems to be a way in which one can set site dependent properties in a relatively easy manner. Though I have to admit that it was a bit unclear if this was the preferred option (@sphuber, @broeder-j ?).

My guess is that one could use the set_attribute method as it is done for the symbols for the site dependent properties which are not included in the StructureData.

Something like this

class EnrichedStructure(StructureData):
    def __init__(self, structure=None, **kwargs):
         super().__init__(**kwargs)
         if structure is not None:
             self.set_structure(structure)

     def set_structure(self, structure, **kwargs):
 
         selective_dynamics = kwargs.pop('selective_dynamics', [['T','T','T'] for site in structure.sites])
 
         self.set_attribute('selective_dynamics', selective_dynamics)
         for site in structure.sites:
             kind = structure.get_kind(site.kind_name)
             self.append_atom(position=site.position, symbols=kind.symbols, weights=kind.weights)

JPchico avatar Oct 21 '20 07:10 JPchico

Thanks a lot @JPchico.

I think we did not settle on a preferred option yet. So here is a recap on where my thoughts are at this point. They might not be aligned with the rest, so please correct. What we all agree on is that the basic structure dataset should only be a structure and nothing else. Then there is consensus regarding the fact that for instance selective dynamics is usually a way of arriving at the structure. The question then becomes, where this should be stored. Intuitively it would possibly make sense to store this on the calculation or something similar to that. However, things become a bit more blurred when we for instance want to load a structure from another calculation where we do not have the provenance. E.g. say directly from an existing structure file. Then it would be rather easy to use an example like above to keep track of this. Otherwise the parser would have to have dangling attributes that needs to be attached to something. It would for instance not make sense to attach these the calculation you want to launch, as it is really a property of a previous calculation (but of course, it can also be a part of the new calculation). It would make sense to attach it to the structure and this would also make it easier for the parser. What is left then is possibly to agree on how interconnected these two should be. A solution like the one above would be rather easy to implement, but would also need a recreation of a node. If we had a container that we only linked to the structure, it would not require the creation of a new node if, say we already created a structure node from a calculation etc.

An alternative, which is maybe the most correct is to not allow to load position files directly (if you want to handle anything more complex than a simple structure), but require that they are immigrated (which can be regarded as a calculation sort of) and as such we can attach such properties on the calculation. Then this immigration represent the import source and would thus contain all kinds of statuses that we might want to know later in the provenance graph. Formally I am maybe leaning towards this approach. This also make sense as we either have performed a previous calculation that ejected the position file based on the restrictions for the dynamics so we somehow need to represent "that calculation" and one way of doing this is to immigrate it and create a calculation node. Or we load a clean structure file and want to perform selective dynamics, which we could then set on the calculation we will execute. Finally, the combination is also possible. The problem with this approach is that not all codes have an immigrator yet (we do for VASP, so that is not a problem).

espenfl avatar Oct 21 '20 10:10 espenfl

For your reference, if the immigrator seems useful, please have a look at https://github.com/aiidateam/AEP/pull/12. Also, please consult e.g. the immigrator we have in VASP: https://github.com/aiida-vasp/aiida-vasp/blob/develop/aiida_vasp/calcs/immigrant.py (and references). We basically just override and it would be better to get immigration functionality into AiiDA core.

espenfl avatar Oct 21 '20 10:10 espenfl

Since I just stumbled over this due to similar requirements (linking initial magnetic moments and MKP grid params with a structure): what is the current state?

And wrt extending the StructureData class: I'd rather prefer composition than extension (following the software engineering advice to use inheritance only for specialization). Since we have a graph it could be done by having an ExtraStructureData type which has to be linked to a StructureData and then specialize this ExtraStructureData for the different use cases (codes most likely). When running a workchain (or a calculation) there can be a lookup for such linked extra data, which then gets used and linked explicitly. The only extension for the StructureData would then be a lookup helper to easily get to those extra nodes (if available). Would that sound feasible?

dev-zero avatar Mar 19 '21 10:03 dev-zero

If you mean by "linking an ExtraStructureData" to actually have that be a separate Data node and linked that to the StructureData, then that is currently not possible since links from Data to Data are not allowed. There have been talks a long time ago to extend the set of links that are possible, but this will take a significant amount of work in aiida-core.

sphuber avatar Mar 19 '21 10:03 sphuber

@sphuber yes, that's what I meant. A bit unfortunate since the inheritance model can become quiet messy, on the other hand having more separate data appearing out of thin air (rather than as a result of a process) is suboptimal for provenance I guess.

dev-zero avatar Mar 19 '21 10:03 dev-zero