atomate2 icon indicating copy to clipboard operation
atomate2 copied to clipboard

Standardize approach for postprocessing data: use builders?

Open mattmcdermott opened this issue 2 years ago • 4 comments

A point of confusion when writing a new Flow seems to be how to process the output data, particularly when it involves outputs from multiple jobs. Should we put postprocessing code in a job? Or should we use builders? Currently, there is only one builder in atomate2 (this is the ElasticBuilder):

https://github.com/materialsproject/atomate2/blob/98bbc0e0715d083866fd632babfcb89594dcbfd5/src/atomate2/vasp/builders/elastic.py#L20

IMO we should almost always recommend using builders, which is already the standard for MP (and more robust). I took this strategy in the magnetic ordering workflow (WIP #432), where I implemented a builder under common that can be easily adapted to results from any DFT code. This may be a good design pattern (where possible).

Perhaps we should open this up for discussion?

@computron @utf

mattmcdermott avatar Aug 09 '23 06:08 mattmcdermott

Overview

I think we should support both. Often you want to be able to use the outputs of flow in a subsequent calculation. E.g., the AMSET flow calculates the elastic constant as part of it. To avoid having to run the builders midway during a workflow, the elastic document should be calculated during the flow itself and stored in the database. Afterwards, if you want to recalculate the elastic constants using different settings, or say you calculated some more displacements, then you can run the builder to generate the elastic collection separately.

In my opinion, the cleanest way to do this is to share as much code as possible between the builder and the calculation of properties during the workflow. The design pattern I adopted was to put class methods on the property task documents that construct the document from some input data. The process then works like this:

  1. The workflow knows exactly which documents are needed to construct the property document as it has the full workflow graph. It can then just call the document class method to construct the property.
  2. The builder's job is to decide which calculations are needed to construct the property. This should be all of the logic of the builder, just deciding which calculations to use, extracting the relevant fields from them, and then passing them to the document class method.

Example

As an example, the ElasticTaskDocument has a method from_stresses.

https://github.com/materialsproject/atomate2/blob/36e791a99613b58f7103b0307cf0b5391a98a2b1/src/atomate2/common/schemas/elastic.py#L141-L153

In the elastic workflow, this is called using the outputs of jobs in the workflow. This construction happens in a job:

https://github.com/materialsproject/atomate2/blob/36e791a99613b58f7103b0307cf0b5391a98a2b1/src/atomate2/vasp/jobs/elastic.py#L280-L291

In the elastic builder, the builder collects the relevant documents and then calls the exact same method to construct the document:

https://github.com/materialsproject/atomate2/blob/36e791a99613b58f7103b0307cf0b5391a98a2b1/src/atomate2/vasp/builders/elastic.py#L266-L274

I think this way we get the best of both worlds. Once you have written the task document and the class method, it is very little effort to call this method in the workflow. Creating the builder is more effort but none of it is duplicated, as you can simply reuse the same code for constructing the document. My suggestion would be to implement this approach in the magnetic ordering workflow. I also recommended this be implemented in the ferroelectric workflow too. E.g., see this comment: https://github.com/materialsproject/atomate2/pull/196/files#r1143386198

Documentation

If people agree with this approach, I suggest we add a page in the developer documentation about this design decision. Hopefully, once we have a few examples, this approach will be clearer to people.

Obviously I would appreciate input from other developers as to whether there are downsides or issues with this approach.

utf avatar Aug 14 '23 09:08 utf

Excellent write-up! @utf

I agree -- if the schema construction approach is used, this shouldn't be too much extra work to implement for a developer.

My original hesitation with writing a postprocessing job was how to avoid getting stuck in workflows where failure of at least one job is very common (for example, a very unfavorable transformed structure cannot converge during relaxation). As long as we also have the builder, this provides some extra flexibility :)

mattmcdermott avatar Aug 14 '23 18:08 mattmcdermott

We discussed this briefly in the atomate2 meeting on 8/25/23 but weren't able to finish the discussion. A few notes to follow up on here:

  1. @janosh points out that the example schema for elasticity is reproduced in almost the same way in the emmett repo. Is there a reason to not consolidate these?
  2. I brought up some concerns regarding data size and duplication (i.e. data consistency) if these builder documents are stored both within the Jobflow infrastructure and as part of an analysis collection. For 10 runs this is really not a problem to duplicate, for 10,000 it could be.
  3. I wanted some more details about what job_dirs was and whether we had a better way to track directories when runs occur across multiple machines. Note that the emmett version of this scheme does not have job_dirs.

computron avatar Aug 25 '23 18:08 computron

Just to clarify, I didn't compare in detail but the ElasticityDoc in emmet-core

https://github.com/materialsproject/emmet/blob/4f2c35976d05369f3947014dbfef6baf98e1500f/emmet-core/emmet/core/elasticity.py#L156-L353

looks similar to ElasticDocument in atomate2. So if there is potential for consolidation, we should pursue it in the same spirit as #486

https://github.com/materialsproject/atomate2/blob/36e791a99613b58f7103b0307cf0b5391a98a2b1/src/atomate2/common/schemas/elastic.py#L111-L252

janosh avatar Aug 25 '23 18:08 janosh