aiida-core
aiida-core copied to clipboard
Add CLI command to dump inputs/outputs of `CalcJob`/`WorkChain`
Currently, we have verdi calcjob inputcat/outputcat
for CalcJob
, but there is no straightforward way to dump the inputs/outputs of a WorkChain
to some files. The use case of this can be:
- easier to debug the underlying code (e.g. QE
pw.x
) using the exact same inputs generated by aiida, without working on the python side - the files can be easily sent to someone who is not familiar with AiiDA, to reproduce/analyze results
Here I added an example CLI command verdi workchain inputsave
, to recursively iterate through the called descendants of a workchain and save the input file of the calledCalcJob
into a folder. For example, the structure of the saved folder for a Wannier90BandsWorkChain
is
├── 1-scf
│ └── 1-iteration_01
│ └── aiida.in
├── 2-nscf
│ └── 1-iteration_01
│ └── aiida.in
├── 3-wannier90_pp
│ └── 1-iteration_01
│ └── aiida.win
├── 4-pw2wannier90
│ └── 1-iteration_01
│ └── aiida.in
└── 5-wannier90
└── 1-iteration_01
└── aiida.win
The current PR is working, but still there are plenty of tasks to be done to make it generic enough.
TODO
- decide which namespace to use for the command, should be under
verdi calcjob
, a newverdi workchain
, or sth else that combines bothcalcjob
andworkchain
(since now the command works for bothCalcJob
andWorkChain
)? - the folder structure, should
uuid
be included in the folder names? Currently I am using link labels
comments from Giovanni
- add a Readme or Metadata.txt or similar in each folder, with info eg on time, uuid,... (also to help AiiDA users to find back the nodes)
- have a way to also dump any file from input links (eg have subfolders inputnode/pseudo_Ba, inputnode/pseudo_Ti,...)
Further optional TODO
In addition, there are several other convenience CLI commands I often use, they are here https://github.com/aiidateam/aiida-wannier90-workflows/blob/main/src/aiida_wannier90_workflows/cli/node.py
- inputcat/outputcat for workchain, that by default shows the last calcjob and allows user to specify a link_label and goto the corresponding calcjob
- cleanworkdir for workchain, that clean up all the nested calcjob of a workchain
- gotocomputer for workchain / RemoteData / RemoteStashFolderData
Would be great if someone versed in aiida-core could use these as inspirations, and write some generic CLI commands that cover these usages. Thanks!
As a note, I assigned this to Ali and Julian (they are not yet in the project so I cannot tag them), and @eimrek was involved in the discussions with @qiaojunfeng
Thanks @qiaojunfeng . Interesting idea, but I have some questions about the design. How does the output structure of this command make it easier to rerun the calculations manually? If I understand correctly, the command only dumps the main input file, but that is typically not sufficient as there can be multiple input files for a CalcJob
. I feel that the implementation may be a bit tied to your use case of QE, where there is a single principle input file. If we were to accept this in aiida-core
, we need to make sure there are no assumptions that are based on a specific code, but it should be generally applicable.
That being said, even for QE I am not sure if this is very useful. As is, the generated output structure cannot be directly used to launch the calculations, can it? For example, the pseudopotentials are not written. So a user will have to manually add those (for each subfolder). And what about restart files? For the NSCF, the restart files won't be found and the user will have to manually add those.
What I think the command should really do is copy all input files such that it can be relaunched straight away. The implementation will be a bit more involved though. The total set of input files is only directly available from the remote working directory, but that is not stored permanently by AiiDA and can disappear, so we cannot rely on that. Some input files are copied to the repo of the CalcJobNode
which can be easily dumped using verdi node repo dump
, but that also doesn't contain all files. Some of the written input files can be excluded by the CalcJob
plugin through the provenance_exclude_list
and other files are written only to the remote working directory by the daemon, for example files specified through FolderData
, RemoteFolder
and SinglfileData
nodes. So to reconstruct the complete set of input files to an arbitrary path, the command should really call CalcJob.presubmit
followed by aiida.engine.daemon.execmanager.upload_calculation
. This would therefore require some refactoring in aiida-core
to make this nicely doable.
That all being said, in some cases we may not want to copy all input files. Taking QE as an example, if you want to dump the inputs for an NSCF calculation, the complete dump would include the restart files, but these can get very big and so in some cases the user may not want to dump these. So you would have to add options to exclude these.
I can see how this command is very useful for workchains of QE calculations, but in order to justify adding this to aiida-core
, we have to think hard how this can be useful for the generic use case as well. Otherwise it might just be better to keep it in the code specific plugin package.
Hi @sphuber - indeed we have to make it generic.
As you see in one of my comments that @qiaojunfeng reported, I had thought to this. One idea was to report the raw files of any direct input and link them with the correct input link name. so e.g. for QE you would have a subfolder (per calcjob) raw_inputs/
, for instance, an inside for instance pseudo_Ba/Ba.UPF
, pseudo_Ti/Ti.UPF
, ...
Then we also considered that indeed it might be instead better to just run the presubmit, essentially running a dry run; this should already do essentially everything (putting also the pseudos; and leaving information on which files should be copied directly on the remote for restarts).
Note that the goal here is not that one should necessarily be able to just run again in the same folder and everything works. (Of course, if we achieve this, this is much better). The goal is that an experienced user (not of AiiDA, but of the underlying code - in our case, this was triggered by a developer of CP2K not interested in learning AiiDA) can inspect the raw inputs and outputs, to see what was run. It's OK in my opinion to give minimal "human-readable" instructions to them (possibly plugin specific) on what to do if they really want to rerun (things like: "for each folder, move the pseudos from folder XX to folder YY and rename it accordingly; if it's a restart, you first have to copy this folder from the parent to the restart folder, where "this folder" is written in e.g. a 'calcinfo.json' or similar file; etc.). As long as these are short and minimal, a code developer will understand what to do (possibly even without).
Again, if we can really make this "just runnable" (at least without restarts) that's OK, but I don't think it's crucial for the main goal of exposing the raw files to interested people (who don't want to dig into how AiIDA works).
The PR itself needs some cleanup indeed. The suggestion is that @khsrali and @GeigerJ2 work on this in the next couple of weeks, and then we discuss if this is general enough. But of course further comments are welcome also to help them fix this PR.
Hi @qiaojunfeng, thank you for this contribution! I think this is a very useful feature for people that are starting with AiiDA as it eases the transition from the directory tree to a more AiiDA-centric approach. I made some changes to the code locally. Is it OK for you if I commit them and continue working directly on top of this PR?
During development, I used a simple, exemplary PwBandsWorkChain
(I'll also look at some aiida-vasp
workchains in the future). Just to mention the three main points that came up:
- The call to
get_option()
: https://github.com/aiidateam/aiida-core/blob/1410179d0d6021aacb64d43da77bdc3341590de3/src/aiida/cmdline/commands/cmd_workchain.py#L53 failed for thecreate_kpoints_from_distance
step, as the resultingCalcFunctionNode
does not implement that method. Thus, I filtered the list of links via:links = [link for link in links if not isinstance(link.node, orm.CalcFunctionNode)]
to remove those. While this could also be resolved by checking the node type of the links in the_get_input_filename
function, in the current implementation, a subdirectory is still created for theseCalcFunctionNodes
, even though there are noSingleFileData
objects to be written to disk (though, as we'll continue working on it, this point will probably not be relevant). - For the steps for which the
link_labels
were not specifically set in the plugin, I'm using theprocess_labels
to get a more descriptive directory tree, e.g.:wc-387 ├── 1-relax │ ├── 1-PwBaseWorkChain │ │ └── 1-PwCalculation │ └── 2-PwBaseWorkChain │ └── 1-PwCalculation ├── 2-scf │ └── 1-PwCalculation └── 3-bands └── 1-PwCalculation
- I'm dumping the outputs using the
copy_tree()
method via.outputs.retrieved
from the respectiveCalcJobNode
. I was expecting it would be similarly straightforward for multiple input files, but haven't found a way so far (I'll try the suggestion given by @giovannipizzi).
Further, I think dumping the main input/output files for easy inspection vs. dumping everything to be able to fully restart the simulations are two different use cases, and it would already be valuable to have the former. In general, I also think the user should have the option to manually include/exclude additional files.
Apart from the QE-specific assumption of a single input file, the function seems conceptually generic, and I would expect it to work similarly for other workchains, or am I missing something?
And lastly, regarding the namespace, it seems natural to assume that something like verdi workchain
could also exist, and probably more functionalities that fit into cmd_workchain
will come up over time. Though, in this case, I'd only keep the recursive logic here and make the dumping of input/outputfiles accessible for the CalcJobs
themselves. Curious to hear your thoughts!
Thanks for your comments @GeigerJ2 . Just some quick pointers I noticed when reading them:
Thus, I filtered the list of links via: links = [link for link in links if not isinstance(link.node, orm.CalcFunctionNode)] to remove those.
This is indeed a solution, but there might be a better one. Instead of using a "black list" of sorts with known exceptions, you might want to just catch the AttributeError
that is thrown when get_option
is not defined. This will make the code more robust as there might be other cases you didn't think off. I know for sure that this one will fail for example if there is a WorkFunctionNode
.
For the steps for which the link_labels were not specifically set in the plugin, I'm using the process_labels to get a more descriptive directory tree, e.g.:
Maybe a combination of the two would be even better? If you have multiple subsequent calls of the same subprocess, having just the same name prefixed with an integer is also not so informative. If a call_link_label
is defined, it means the work chain developer put it there intentionally.
I'm dumping the outputs using the copy_tree() method via .outputs.retrieved from the respective CalcJobNode. I was expecting it would be similarly straightforward for multiple input files, but haven't found a way so far (I'll try the suggestion given by @giovannipizzi).
Input files (at least some of them) of a CalcJobNode
are stored in its own repository. So you can get them with the same API, e.g., CalcJobNode.base.repository.copy_tree()
. As I mentioned in my previous post, however, this does not include all inputs. Files that were copied using the local_copy_list
, or remote_copy_list
constructs will not be there. Similarly, even files that were copied to the sandbox folder by the plugin, but that were listed in the provenance_exclude_list
, will not be present. For QE, typically this will include pseudos, so you won't find them in the calcjobnode's repository. And this brings me back to my earlier point...
Apart from the QE-specific assumption of a single input file, the function seems conceptually generic, and I would expect it to work similarly for other workchains, or am I missing something?
It might "work" as in there won't be any exceptions, but I still have my doubts whether it is really all that useful for the general case. Defining a default input file is not required and maybe not all that common. The current implementation will even raise an exception in that case. It says to specify a path, but that is not implemented and wouldn't even work if you have different CalcJob
s in a tree that have different input paths. So I think in most cases you will get only a fraction of input files, if any at all.
If this command is to be useful, we have to go about it slightly differently. I would suggest we have to go through the same logic as an actual CalcJob
goes in writing its input based on the input nodes that are stored in the database. This functionality is essentially in the upload_calculation
function of aiida.engine.daemon.execmanager
. Instead of the the current implementation, we should try to reuse that code. This will take care of including files that were copied using local_copy_list
and we could refactor it to ignore the provenance_exclude_list
. We would have to refactor the code a bit to allow the remote_copy_list
to actually copy from the remote to the localhost. @mbercx has an open PR https://github.com/aiidateam/aiida-core/pull/6196 that proposes already to make this logic a bit more permissive. We could think of allowing it also to copy to the localhost.
This solution would require quite some refactoring, but I think this might anyway not be such a bad idea. The code there can be a bit dense. But if we think this might be an interesting approach, I'd be happy to go through it during a call some time and give some pointers and guide you in performing the refactoring.
Hi all, great comments! I also thought of using the same logic that is used in the presubmit step. However, I realize now that this will require that the people who want to get the files will need to install all the plug-ins of the work chains they want to traverse. Think in 5 years time, they might be using a new AiiDA and the plug-ins will not be installable anymore with that version, might be in a new non backward compatible version etc.
We might decide to give both options, but I think the current approach is general enough to start to be useful. You can still dump a yaml version of the info in local copy list, remote copy list etc (we anyway already dump a json of those). And explain people what it means. And in any case also the presubmit approach, for the remote copy list, can only explain what needs to be done and not do it.
The design idea we had is to have, optionally, the possibility for a plug-in to provide its custom wait to dump files. So if you user has the plug-in install and the plug-in supports it, the dump will be done in a more plug-in specific way. But the current suggested approach, to me, is a good default when such plug-in specific function does not exist.
In practice, I think that the best is to try to run this function against a few common use cases (qe, wannier, vasp, fleur etc) and see what we get and what we miss. In practice, I would do it for the various archives of the recent common workflows paper on verification (that is the actual usecase that triggered this pr), and then ask the respective code developers if the resulting folder is clear enough and contains all info they might need.
However, I realize now that this will require that the people who want to get the files will need to install all the plug-ins of the work chains they want to traverse. Think in 5 years time, they might be using a new AiiDA and the plug-ins will not be installable anymore with that version, might be in a new non backward compatible version etc.
Good point. It would require the plugin to be installed to do what I am proposing. I can see how that is limiting.
And in any case also the presubmit approach, for the remote copy list, can only explain what needs to be done and not do it.
What I was suggesting was not just calling the presubmit, but also the aiida.engine.daemon.execmanager.upload_calculation
which reads those instructions generated by the plugin and copies local and remote files. This would just require some refactoring but in principle it is perfectly possible. But this still requires the actual plugin to be isntalled as the CalcJob.prepare_for_submission
is called before that.
In practice, I think that the best is to try to run this function against a few common use cases (qe, wannier, vasp, fleur etc) and see what we get and what we miss. In practice, I would do it for the various archives of the recent common workflows paper on
Note that these are still just use cases that all come from a single domain, namely the original DFT users, and don't necessearily represent the general case. If this is really a use case that stems from these plugins, and especially from the common workflow project, is it maybe not also an idea to consider adding it to the aiida-common-workflows
package?
Thank you for your helpful and valuable feedback, @sphuber!
you might want to just catch the
AttributeError
that is thrown whenget_option
is not defined.
This was my original solution, but in the initial implementation of @qiaojunfeng, it led to the creation of a directory also for these CalcFunction
steps of the workchain that did not use/create input/output-files.
This will make the code more robust as there might be other cases you didn't think off. I know for sure that this one will fail for example if there is a
WorkFunctionNode
.
Yes, good point!
Maybe a combination of the two would be even better?
Agreed!
Input files (at least some of them) of a CalcJobNode are stored in its own repository. So...
Just learned something new about the internals :)
I'd be happy to go through it during a call some time and give some pointers and guide you in performing the refactoring.
If you have the time, that would be great!
In practice, I would do it for the various archives of the recent common workflows paper
Thanks, @giovannipizzi for this pointer, that should give me plenty of workchains to try out when working on this.
Note that these are still just use cases that all come from a single domain, namely the original DFT users, and don't necessearily represent the general case.
The general use case should be WorkChain
s that call CalcJob
s that have file I/O on HPC (or locally), no? While the codes given by @giovannipizzi are from the field of DFT, I don't think they are otherwise too specific in that regard? Once we have use cases from the weather and climate community in the SwissTwins project, we can also see how the functionality could be of use there.
Hello everybody,
I took the PR and created a branch from it on my local fork of core
to be able to work freely on it, without messing anything else up. I hope this follows proper GitHub ettiquette - if not, please let me know :)
I did some changes and a first proof-of-concept for dumping the main WorkChain
files works. Though, I wanted to run it by you before continuing with it. A more in-depth explanation is given in the commit message. Thank you!
Thanks a lot @GeigerJ2
I took the PR and created a branch from it on my local fork of
core
to be able to work freely on it, without messing anything else up. I hope this follows proper GitHub ettiquette
That is perfectly fine because, as you say, the original remains untouched. There is no set rule here really and it depends on the situation. It is possible for you to, when you are done, make a PR to the original branch, which then automatically updates this PR. But if the original author is fine with it, they can also give you direct write access to their branch and update there directly. Or just close the PR and you open a new one. Fully up to you and @qiaojunfeng in this case.
I did some changes and a first proof-of-concept for dumping the main
WorkChain
files works. Though, I wanted to run it by you before continuing with it. A more in-depth explanation is given in the commit message. Thank you!
As requested, I will first give some high-level feedback on the design and some common principles we tend to keep in aiida-core
.
- For any CLI functionality that we provide, it is good practice to first make this available just in the Python API, and then have the CLI command be a very thin wrapper around it. This makes it possible to use the functionality not just from
verdi
but also directly from Python. The functionality you are adding here, could typically be placed somewhere inaiida.tools
which contains more functions of this type that performs some post processing of provenance data. - Do we really need explicit CLI/API endpoints for the calcjob inputdump and outputdump? These are essentially just a single method call
CalcJobNode.base.repository.copy_tree
andCalcJobNode.outputs.retrieved.copy_tree
. I don't think we need specific functions for this in the API and the CLI already hasverdi node repo dump
that provides this for both cases. - Naming: I see there is quite a bit of logic to determine the naming of the directory structure. Would it be useful to keep the same naming as much as possible as other tools would give it? I am thinking of
verdi node graph generate
andverdi process status
that essentially also give an overview of the callstack. It would be useful I think if the naming matches between these tools, otherwise it may be difficult to match the various outputs. Would the following format work perhaps:{index}-{call_link_label}-{process_label}
. Wherecall_link_label
is omitted if it is not defined or just theCALL
defined. - What is the reason you prefer for constructing the folder structure up front, before copying the files? If you make a single recursive function that goes through the call stack that for each
CalcJobNode
just copies inputs and outputs, creating the output directory if it doesn't already exist. Something like the following (not tested):
import pathlib
from aiida.common.links import LinkType
from aiida.orm import CalcJobNode, WorkChainNode
def workchain_dump(workchain: WorkChainNode, filepath: pathlib.Path) -> None:
called_links = workchain.base.links.get_outgoing(link_type=(LinkType.CALL_CALC, LinkType.CALL_WORK)).all()
for index, link_triple in enumerate(sorted(called_links, key=lambda link_triple: link_triple.node.ctime)):
node = link_triple.node
if link_triple.link_label != 'CALL':
label = f'{index:02d}-{link_triple.link_label}-{node.process_label}'
else:
label = f'{index:02d}-{node.process_label}'
if isinstance(node, WorkChainNode):
workchain_dump(node, filepath / label)
elif isinstance(node, CalcJobNode):
node.base.repository.copy_tree(filepath / label)
node.outputs.retrieved.copy_tree(filepath / label)
Let me know what you think.
Hi @sphuber, thanks a lot for your valuable feedback, it is highly appreciated!
That is perfectly fine
Happy to hear that :)
But if the original author is fine with it, they can also give you direct write access to their branch and update there directly.
In principle, that should be the default anyway if the author hasn't unset "Allowing edits by maintainers" when making the PR, no? Though, better to ask ofc.
Regarding your feedback:
-
Yes, fully agreed! Felt a bit weird to have everything in the
cmd_
file, but I wasn't sure where the code would belong otherwise. -
Ah, I wasn't actually aware of
verdi node repo dump
... maybe this is also something we could highlight better (see #6290), as it seems like a very useful command to know, especially for beginners? Though, at least for aPwCalculation
it only dumps the input and submission file, and.aiida
folder*, so one would still be missing theretrieved
files. Apart from that, is there a particular reason whyverdi node repo dump
implements another_copy_tree
, rather than usingnode.base.repository.copy_tree()
?I don't have a strong opinion about the API endpoints, though dumping multiple input/output files might be a functionality new users might expect, e.g. for debugging their submissions without having to use the AiiDA API (even though they should :D).
-
For the naming, I'm fine with
{index}-{call_link_label}-{process_label}
, and I agree it should be consistent. I abbreviated the AiiDA classes, but maybe better to keep the full names to avoid any confusion. We just need to make sure that, if multiple lower-levelWorkChains
are called in a higher-levelWorkChain
, or multipleCalcJobs
in oneWorkChain
, that they are properly numbered, but it seems like this is already taken care of with thelink_label
:+1: -
I split it into multiple functions to get somewhat of a separation of concerns, assuming that in the future, we might want to add other
WorkChain
commands, where the conversion of theWorkChain
to a directory tree construct could be useful. For example, if one wants to only dump either the input files, or the output files, this logic would need to be set within the recursive function, no? Additionally, as mentioned before, we might also want to implement a version that, given the necessary plugin being installed, more (all) input files are written (e.g. the pseudos for QE, ideally, making the calculation fully restartable), then the question would be again where to implement this logic.
(*I assume this is intentional, as the .aiida/calcinfo.json
contains the references to the other input files, via local_copy_list
, remote_copy_list
and the output files via retrieve_list
, so provenance is recorded, and the full file structure could be reconstructed?)
maybe this is also something we could highlight better (see https://github.com/aiidateam/aiida-core/issues/6290), as it seems like a very useful command to know, especially for beginners?
Sure. The problem is though that there are so many potentially useful things that it is difficult to organize. Of course you could start adding them to a single page, but this would quickly get very chaotic and disorganized. Then you start organizing it and then you end up with the same problem as in the beginning where you have to know where to look to find something. Not saying we shouldn't try to solve this, just that it might not be as trivial as it initially seems.
Though, at least for a PwCalculation it only dumps the input and submission file, and .aiida folder*, so one would still be missing the retrieved files.
Sure, you would have to call it twice, once for the node it self, and once for its retrieved
output. Just as your code is doing, you have inputdump
and outputdump
.
Apart from that, is there a particular reason why verdi node repo dump implements another _copy_tree, rather than using node.base.repository.copy_tree()?
Probably just because that command got added before I implemented copy_tree
on the repository (which was quite recent) 😅 It'd be great if you want to open a PR to simplify that 👍
I don't have a strong opinion about the API endpoints, though dumping multiple input/output files might be a functionality new users might expect, e.g. for debugging their submissions without having to use the AiiDA API (even though they should :D).
Not quite sure what you mean here. But I am not saying that we shouldn't have the CLI end points. I am just saying that the functionality should be mostly implemented as normal API, such that people can also use it from the shell or a notebook. The CLI commands can then be minimal wrappers. But maybe I misunderstood your comment here?
We just need to make sure that, if multiple lower-level WorkChains are called in a higher-level WorkChain, or multiple CalcJobs in one WorkChain, that they are properly numbered, but it seems like this is already taken care of with the link_label 👍
That is what the {index}
is doing right? I am not relying on the link_label for this.
I split it into multiple functions to get somewhat of a separation of concerns, assuming that in the future, we might want to add other WorkChain commands, where the conversion of the WorkChain to a directory tree construct could be useful. For example, if one wants to only dump either the input files, or the output files, this logic would need to be set within the recursive function, no? Additionally, as mentioned before, we might also want to implement a version that, given the necessary plugin being installed, more (all) input files are written (e.g. the pseudos for QE, ideally, making the calculation fully restartable), then the question would be again where to implement this logic.
This really feels like a pre-mature abstraction, and don't think it would even necessarily be a good one. I think the difference in code length and complexity is significant and don't think the separate version gives a real benefit, but it is significantly more difficult to understand and maintain. Unless there are concrete benefits now, I would not overcomplicate things.
Not saying we shouldn't try to solve this, just that it might not be as trivial as it initially seems.
Yeah, it would definitely require putting some thought into it. Though, there are a few examples that seem to pop up frequently, and that one doesn't come across when just going through the tutorials, i.e. gotocomputer
.
Probably just because that command got added before I implemented copy_tree on the repository (which was quite recent) 😅 It'd be great if you want to open a PR to simplify that 👍
Alright, I see. Sure, will do!
Not quite sure what you mean here. But I am not saying that we shouldn't have the CLI end points. I am just saying that the functionality should be mostly implemented as normal API, such that people can also use it from the shell or a notebook. The CLI commands can then be minimal wrappers. But maybe I misunderstood your comment here?
I mean that having verdi calcjob inputdump
and verdi calcjob outputdump
would be nice to have in the CLI
, but yes,
I agree that it's best to have the actual (simple) implementation as part of the normal API, and just have wrappers for it in cmd_calcjob
. So I think I just misunderstood your comment.
This really feels like a pre-mature abstraction, and don't think it would even necessarily be a good one. I think the difference in code length and complexity is significant and don't think the separate version gives a real benefit, but it is significantly more difficult to understand and maintain. Unless there are concrete benefits now, I would not overcomplicate things.
Good point there! I'll revert it back to an implementation with just the one recursive function, based on your snippet, in the appropriate location, making sure it works, and
That is what the {index} is doing right? I am not relying on the link_label for this.
seeing if the indexing checks out for some examples. Thanks again for your assistance with this, it's very insightful! :)
(Copying the commit message here, which I thought would appear here in that way, sry for that... :upside_down_face:)
Hey all,
I commited the changes from my local fork of the PR now here directly, as I think that will make working on it more straightforward. Hope that's OK for everybody.
Based on @qiaojunfeng's original implementation, there's a recursive function that traverses the WorkChain
hierarchy. Now, as actual file I/O is only done for the CalcJobNode
s involved in the workchain, I factored that out into its own function _calcjob_dump
. I also moved these implementations into tools/dumping/processes.py
so that it's available via the normal Python API. In cmd_workchain
and cmd_calcjob
the commands verdi workchain dump
and verdi calcjob dump
are only wrappers around the respective functions.
_calcjob_dump
, by default, dumps the node.base.repository
and node.outputs.retrieved
using copy_tree
, as well as the input_nodes
of the calcJobNode
if they are of type SingleFileData
or FolderData
. I started working on the --use-prepare-for-submission
option, but as previously indicated by @sphuber (also see here), it's not straightforward to make it work as intended, so I put that on hold for now and added a warning.
For each WorkChainNode
and CalcJobNode
a selected set of the node @property
s are dumped to yaml files. extra
s and attribute
s can also be included via the cli
. I initially had a function for this, but realized that I'm just passing these arguments all the way through, so I encapsulated that in the the ProcessNodeYamlDumper
class. Now, an instance is created when the respective cli
command is run, and the arguments are set as instance variables, with the instance being passed through, rather than the original arguments. The dump_yaml
method is then called at the relevant positions with the process_node
and output_path
arguments. Regarding relevant positions: To get the node yaml file for every ProcessNode
involved, it's called in cmd_workchain
and cmd_calcjob
for the parent node, and subsequently for the outgoing links. Maybe there's a better way to handle that?
The other commands initially mentioned by @qiaojunfeng also seem very interesting and could probably easily added based on his original implementation, though we should agree on the overall API first.
A few more notes:
- The
cli
options forverdi workchain dump
andverdi calcjob dump
are basically the same and the code is duplicated. We could avoid that by merging it into one, e.g. underverdi node dump
, however, as a user, I would findverdi workchain dump
andverdi calcjob dump
more intuitive. Also,verdi node repo dump
uses a former implementation ofcopy_tree
, so I'd go ahead and update that, but in a separate PR (related, do we want to also implement (or change it to) justverdi node dump
?) - Regarding the
yaml
dumping, theProcessNodeYamlDumper
class is quite specific and just a little helper to get the job done here. Should we generalize such functionality, e.g. to resolve issue #3521 to allow for dumping of computers/codes to yaml, or keep these things separate? - Currently, the pseudo directories are called
pseudos__<X>
, and I thought about splitting on the double underscore to just have onepseudos
directory with a subdirectory for each element. Personally, I'd find that nicer than having a bunch ofpseudos__<X>
, but I'm not sure if the double underscore is based on general AiiDA name mangling, or specific toaiida-quantumespresso
.
Lastly, examples of the (default) outputs obtained from verdi workchain dump
:
dump-462
├── 01-relax-PwRelaxWorkChain
│ ├── 01-PwBaseWorkChain
│ │ ├── 01-PwCalculation
│ │ │ ├── aiida_node_metadata.yaml
│ │ │ ├── node_inputs
│ │ │ │ └── pseudos__Si
│ │ │ │ └── Si.pbesol-n-rrkjus_psl.1.0.0.UPF
│ │ │ ├── raw_inputs
│ │ │ │ ├── .aiida
│ │ │ │ │ ├── calcinfo.json
│ │ │ │ │ └── job_tmpl.json
│ │ │ │ ├── _aiidasubmit.sh
│ │ │ │ └── aiida.in
│ │ │ └── raw_outputs
│ │ │ ├── _scheduler-stderr.txt
│ │ │ ├── _scheduler-stdout.txt
│ │ │ ├── aiida.out
│ │ │ └── data-file-schema.xml
│ │ └── aiida_node_metadata.yaml
│ ├── 02-PwBaseWorkChain
│ │ ├── 01-PwCalculation
│ │ │ ├── aiida_node_metadata.yaml
│ │ │ ├── node_inputs
│ │ │ │ └── pseudos__Si
│ │ │ │ └── Si.pbesol-n-rrkjus_psl.1.0.0.UPF
│ │ │ ├── raw_inputs
│ │ │ │ ├── .aiida
│ │ │ │ │ ├── calcinfo.json
│ │ │ │ │ └── job_tmpl.json
│ │ │ │ ├── _aiidasubmit.sh
│ │ │ │ └── aiida.in
│ │ │ └── raw_outputs
│ │ │ ├── _scheduler-stderr.txt
│ │ │ ├── _scheduler-stdout.txt
│ │ │ ├── aiida.out
│ │ │ └── data-file-schema.xml
│ │ └── aiida_node_metadata.yaml
│ └── aiida_node_metadata.yaml
├── 02-scf-PwBaseWorkChain
│ ├── 01-PwCalculation
│ │ ├── aiida_node_metadata.yaml
│ │ ├── node_inputs
│ │ │ └── pseudos__Si
│ │ │ └── Si.pbesol-n-rrkjus_psl.1.0.0.UPF
│ │ ├── raw_inputs
│ │ │ ├── .aiida
│ │ │ │ ├── calcinfo.json
│ │ │ │ └── job_tmpl.json
│ │ │ ├── _aiidasubmit.sh
│ │ │ └── aiida.in
│ │ └── raw_outputs
│ │ ├── _scheduler-stderr.txt
│ │ ├── _scheduler-stdout.txt
│ │ ├── aiida.out
│ │ └── data-file-schema.xml
│ └── aiida_node_metadata.yaml
├── 03-bands-PwBaseWorkChain
│ ├── 01-PwCalculation
│ │ ├── aiida_node_metadata.yaml
│ │ ├── node_inputs
│ │ │ └── pseudos__Si
│ │ │ └── Si.pbesol-n-rrkjus_psl.1.0.0.UPF
│ │ ├── raw_inputs
│ │ │ ├── .aiida
│ │ │ │ ├── calcinfo.json
│ │ │ │ └── job_tmpl.json
│ │ │ ├── _aiidasubmit.sh
│ │ │ └── aiida.in
│ │ └── raw_outputs
│ │ ├── _scheduler-stderr.txt
│ │ ├── _scheduler-stdout.txt
│ │ ├── aiida.out
│ │ └── data-file-schema.xml
│ └── aiida_node_metadata.yaml
└── aiida_node_metadata.yaml
and verdi calcjob dump
:
dump-530
├── aiida_node_metadata.yaml
├── node_inputs
│ └── pseudos__Si
│ └── Si.pbesol-n-rrkjus_psl.1.0.0.UPF
├── raw_inputs
│ ├── .aiida
│ │ ├── calcinfo.json
│ │ └── job_tmpl.json
│ ├── _aiidasubmit.sh
│ └── aiida.in
└── raw_outputs
├── _scheduler-stderr.txt
├── _scheduler-stdout.txt
├── aiida.out
└── data-file-schema.xml
for a PwBandsWorkChain
and one of its involved CalcJobNode
s.
Thanks a lot @GeigerJ2 , great progress 👍
I started working on the
--use-prepare-for-submission
option, but as previously indicated by @sphuber (also see here), it's not straightforward to make it work as intended, so I put that on hold for now and added a warning.
The current branch seems to implement it though. Does it not work in its current state?
For each
WorkChainNode
andCalcJobNode
a selected set of the node@property
s are dumped to yaml files.extra
s andattribute
s can also be included via thecli
. I initially had a function for this, but realized that I'm just passing these arguments all the way through, so I encapsulated that in the theProcessNodeYamlDumper
class. Now, an instance is created when the respectivecli
command is run, and the arguments are set as instance variables, with the instance being passed through, rather than the original arguments. Thedump_yaml
method is then called at the relevant positions with theprocess_node
andoutput_path
arguments. Regarding relevant positions: To get the node yaml file for everyProcessNode
involved, it's called incmd_workchain
andcmd_calcjob
for the parent node, and subsequently for the outgoing links. Maybe there's a better way to handle that?
I think I would include the dump_yaml
call inside recursive_dump
(only when the node is a WorkChainNode
and _calcjob_dump
itself. This way, the user doesn't have to think to call it manually for the parent node. On that node, why not call it workchain_dump
and calcjob_dump
. The fact that the workchain_dump
will be recursive can be made clear in the docstring, and for the calcjob_dump
it doesn't make sense to be recursive as a CalcJob
cannot call other processes.
The
cli
options forverdi workchain dump
andverdi calcjob dump
are basically the same and the code is duplicated. We could avoid that by merging it into one, e.g. underverdi node dump
, however, as a user, I would findverdi workchain dump
andverdi calcjob dump
more intuitive.
You can also reduce the code duplication differently. If you were to go for my previous suggestion, then the only code besides the recursive_dump
function call is just setting the default for the output path and validation. That can simply be included in the option declaration itself, which can provide the default and do the validation. Then you can define the options once using the OverridableOption
class that we use (see aiida.cmdline.params.options.main
) and then you can easily reuse the option in both commands.
Also,
verdi node repo dump
uses a former implementation ofcopy_tree
, so I'd go ahead and update that, but in a separate PR (related, do we want to also implement (or change it to) justverdi node dump
?)
I would not change the name, because it currently only dumps the content of the repo, so in my mind verdi node repo dump
is accurate.
Regarding the
yaml
dumping, theProcessNodeYamlDumper
class is quite specific and just a little helper to get the job done here. Should we generalize such functionality, e.g. to resolve issue #3521 to allow for dumping of computers/codes to yaml, or keep these things separate?
Yes, I would be strongly in favor of this as opposed to one of solutions. This PR would actually essentially provide this functionality: https://github.com/aiidateam/aiida-core/pull/6255
It specifies a pydantic
model for each ORM entity class (so including ProcessNode
) and that allows to have pydantic
automatically dump the content to JSON (having it do YAML should also be trivial)
Currently, the pseudo directories are called
pseudos__<X>
, and I thought about splitting on the double underscore to just have onepseudos
directory with a subdirectory for each element. Personally, I'd find that nicer than having a bunch ofpseudos__<X>
, but I'm not sure if the double underscore is based on general AiiDA name mangling, or specific toaiida-quantumespresso
.
You could indeed split on the double underscore and nest again, but then you would get paths pseudos/Si/Si.pbe.blabla.upf
. Guess that would be down to preference if you prefer another level or nesting or keep the underscores in the folder name.
The double underscores indeed come from some name mangling. Inputs in AiiDA can be nested, e.g. pseudos for the PwCalculation
are passed as
inputs = {
'pseudos': {
'Si': UpfData()
}
}
The link labels are flat though, so nested namespaces are represented by double underscores.
In a sense then, splitting on the double underscores and making that a subdirectory would map quite nicely on nested input namespaces. But for me, either solution would be fine.
Hi @sphuber, sorry for just getting back to you now. Thanks, happy to hear that :)
The current branch seems to implement it though. Does it not work in its current state?
Currently, it's giving this warning:
Warning: CalcJobNode<pk> already has a `remote_folder` output: skipping upload
and the pseudo
directory is empty. Though, I have to admit that I haven't looked much further into it until now.
I have just implemented your proposed changes. Thank you for your inputs and pointers. The cli
options are
now OverridableOption
s, and passed to both dumping functions. Though, I was not sure how to set the default for the
path at this stage, as it depends on the pk
of the process, so instead I added that as a function to
cmdline.utils.defaults
.
I would not change the name, because it currently only dumps the content of the repo, so in my mind verdi node repo dump is accurate.
I meant also extending the dumping so that not only the repo
of the node is dumped. In that case verdi node dump
could dump the repo
and e.g. retrieved outputs. But that is effectively what verdi calcjob dump
and verdi workchain dump
are doing now, so no need to change or rename anything.
Yes, I would be strongly in favor of this as opposed to one of solutions. This PR would actually essentially provide this functionality: #6255
Fully agree! The PR looks like quite the extensive effort. Is there any estimated timeline for it? That is to say, would merging here with a temporary solution be reasonable, or better to hold off until PR #6255 is merged. I'm riding along on the pydantic hype train, so if I can be of any help there, please let me know!
The double underscores indeed come from some name mangling.
Good, so my intuition was correct :)
But for me, either solution would be fine.
Same here, don't have a strong opinion on that.
Fully agree! The PR looks like quite the extensive effort. Is there any estimated timeline for it? That is to say, would merging here with a temporary solution be reasonable, or better to hold off until PR #6255 is merged. I'm riding along on the pydantic hype train, so if I can be of any help there, please let me know!
There is no fixed timeline, but it is ready for review and has been for a while. @mbercx and @edan-bainglass have both expressed interest in taking a look at this PR in the past as well as its predecessor #6190 I would be in favor of having this in soon. Since it is backwards compatible, I don't see major blockers really
Dear all, great work, let me try to be the "bad guy" as much as possible.
I use the case of the data for CP2K oxides calculations in materialsclous:2023.81
Suppose we are back 10 years ago, someone like me, "A", was doing the calculations and for some reason putting them OA. Suppose also someone like me "B" was upset with some CP2K related results and was willing to check what went wrong.
A would have had in his working space several directories such as HfO2, SiO2, Ti2O3,...
after all calculations would have been compelted and ready to be published he would have done tar -cvzf cp2k_oxides.tgz oxides_dir
and put the file online.
Though point here was... do I really have everything well organized..? This problem does not exist anymore with AiiDA
B Downloades the tar file, executes tar -xvf cp2k_oxides.tgz
expecting to see something like HfO2
somewhere, enter sthe HfO2
directory and inspects what was wrong.
What I did now with AiiDA was:
- find an available aiida environment and bring there the
cp2k_oxides.aiida
archive file - exceute
verdi archive import cp2k_oxides.aiida
The two lines above will be replaced, this is a BIG PLUS, by a link to renku or an equivalent solution
To inspect the content of the archive:
- I executed
verdi process list -a | grep -y work
- identified visually the "main workchains" (in this case
Cp2kCommonRelaxWorkChain
)
Here comes the big disadvantage now compared to 10 years ago:
Path to solution:
-assuming A assigned appropriate descriptions to each workchain, either we create a verdi
command (or provide a script.py
to be executed via verdi
) to return pk description of, e.g., all "main workchains" or the instance activated e.g. in renku already shows such a list (running automatically a query builder script). The command verdi workchain dump pk
could then use part of the description available in creating the dumping directories
-assuming A was lazy, with no descriptions in the workchains, we could still imagine an automated script in renku (or a verdi run script.py
) that, based on the first input StructureData encountered in each "main workchain", will provide to the user a guess description. The same label could then be used by verdi workchain dump
Thanks @cpignedoli for the writeup. I agree that it can be a bit daunting for a user to start to "browse" the contents of an imported archive. Without a high-level human-readable description that accompanies the archive, it can be difficult for a user to know how to analyse the data, especially if they are not familiar with the plugins used and so they don't necessarrily know or understand the workchain layouts. I fully agree that it would be great if we could improve this intelligibility automatically somehow.
That being said, I want to caution about implementing any solutions that are going to be to prejudiced by our own typical use cases that build in assumptions about the structure of archives. For example, in your suggestion, you implicitly assume that all archives are a simple collection of a number of top-level workchains of the same type and that their main input is a StructureData
. This is not really the case in the general sense though. You could have an archive with many different types of top-level workchains, and their main input may not be structures.
What I imagine would really help here is the following:
Have a dead easy way to "mount" an AiiDA archive that allows to inspect it
The Renku functionality is a great example to solving this. We have recently made a lot of progress (and continue to do so) to also make this available on a normal computer locally. It is already possible to create a profile from an archive that doesn't require PostgreSQL. I also have functionality (see https://github.com/aiidateam/aiida-core/pull/6303) that makes RabbitMQ completely optional. With these two steps, users can start inspecting an archive in two steps: pip install aiida-core && verdi profile setup core.sqlite_zip path/to/archive.zip
.
It should be easier to browse the provenance graph
Currently, there are a few ways to browse the contents of an archive. Use the various verdi
commands, such as verdi group list
and verdi process list
. Alternatively the QueryBuilder
API can be used to query anything, but this requires coding knowledge and some understanding of the layout of the provenance graph to find what one is looking for. Finally, there is the graphical browser provided by the Materials Cloud. However, the current functionality is quite limited as in that it only ever shows a single central node, with just the first layer of inputs/outputs. It would be great if we could have a representation similar to that produced by verdi node graph generate
but have it be interactive; allowing it to zoom in/out, click on particular nodes to show more info etc.
As mentioned, I think we are very close to satisfactorily addressing the first part. It will be trivial to install AiiDA and mount an archive. The biggest step I think, is to build a fantastic visual browser that can be run locally. I think that there will always remain a need for a top-level human-readable description with archives to make them understandable to others (just like with any data publications online). One thing we could consider is to allow an archive creator to include this metadata, like a description, to the archive itself.
Hi @sphuber I fully agree that my oversimplification can be too misleading, what you suggest/anticipate goes perfectly in the right direction and "easy browsing" for a general case would of course be a great success. I will test the pip install aiida-core && verdi profile setup core.sqlite_zip path/to/archive.zip
, I was not aware, this is great.
Great comments,i agree with what has been said. I think indeed that there is some responsibility with who puts the archive online to provide minimal guidance to users. This could just be eg 1. A csv file attached to the published entry (with uuids and any other relevant info to explain what is "top level"). For instance, in the case of the verification paper, it could be columns with code name, element name, and configuration, followed by uuid. This is e ought for a human to quickly find the relevant workchain and put it on the dump command by just looking at the csv. Or 2. Provide a short script that runs a query, loops over the results, and prints such a csv or equivalent file/printout. This has the advantage of also providing a bit more context of how the archive is organised, eg if things are in groups, if extras are used to label nodes, etc (and encourages people to putt all this info in aiida and not just in an external text file). We have good examupls of such scripts currently in the Readme of some MC archive entries (I think mc2d, automated wannierisation, verification paper etc). The additional thing we can do is that we have an automated way to have these script autopopulate the first jupyter notebook cells that appear when loading the archive in, eg, renkulab. So people just need click on a button, and execute cells to start browsing.
If we agree on this and implement this, we can then update the Mc archive entries of a fee of our own repos with such scripts (in the correct format to be autoloaded), as good examples to follow. And then invest a bit in "training/communication" : eg make a blog post explaining the philosophy behind them and linking to those examples, having looking into data of others as one of the first steps of the next tutorial we will organise, etc
We can always come up in the future with more general ways to do this, but trying to find a general solution that works automatically for any archive and for people who don't know aiida is very challenging (if at all possible), while I tjjon the approach I suggest is quite practical to implement, the scripts are easy to provide for data owners and easy to use for people who just want to look into the data (again, in the template for the jupyter notebook you can both put the short querying script that prints out uuids, and also some minimal text to guide people on what to do, eg, "get the uuid from the table above for the system you are interested in, and run this command verdi workchain dump UUID
to get..."
Thanks @GeigerJ2 could you please update your branch, address the conflicts and make sure the tests run?
Hi @sphuber, thanks for getting back to this here! I have rebased the branch to incorporate the recent merged changes on main
(thanks to @mbercx for the assistance on this). I also resolved the conflicts that came up, so now the changes of the current individual commit that adds the dumping functionality should be nicely isolated and not be in relation to the older version when the PR was originally created.
The dumping tests are passing, however, mypy
still complains about:
https://github.com/qiaojunfeng/aiida-core/blob/4daa2fc0ccbca4f68df06a425a8e5c5c33b2c259/src/aiida/tools/dumping/processes.py#L467
with:
error: Incompatible types in assignment (expression has type "Process", variable has type "CalcJob") [assignment]
Still gotta figure this one out...
error: Incompatible types in assignment (expression has type "Process", variable has type "CalcJob") [assignment]
Honestly, I think this is a case where an ignore statement is justified. We know why it is raising error but it is fine for us to do it. Don't have to waste more time on this.
That being said, you have bigger problems though :D Currently the tests are also not running. For Python 3.9 it is because you are using type annotation syntax that wasn't added until Python 3.10. See this build error: https://github.com/aiidateam/aiida-core/actions/runs/8633110347/job/23665395861?pr=6276
You essentially need to add from __future__ import annotations
to the top of the file in aiida/tools/dumping/processes.py
(and possibly others). Anywhere, where you use modern syntax like union operators str | None
or normal basetypes like list[str]
and dict[str, str]
.
But even if you fix that, there are 5 unit tests failing in tests/cmdline/commands/test_process.py
. And last but not least, the docs build is failing. There are probably just some warnings about missing references or formatting. You can look in the logs: https://readthedocs.org/projects/aiida-core/builds/24024003/
Yeah, I agree. At this point, it starts feeling like a bit of a waste of time. I'll add the ignore statement.
Ah, OK, well, that is good to know, at least. Wasn't really aware of that ^^
Yes, the tests in tests/cmdline/commands/test_process.py
are failing because the @options.MOST_RECENT_NODE()
option for verdi process status
somehow got lost when resolving the merge conflicts during the rebase. I have already resolved that locally and will push it with the other changes tomorrow, so that everything hopefully and finally runs through 🚀
Alright, readthedocs
build still fails, but if somebody has the time and would like to dogfood this, please be my guest :)
The added commands are verdi process dump
and verdi calcjob dump
.
@sphuber @mbercx @khsrali @superstar54 @mikibonacci @eimrek
Hey guys, I finally started dogfooding this beauty, but quite quickly ran into an issue:
❯ rm -rf *
❯ ls
❯ verdi process dump 62624
Critical: Some files present in the dumping directory. Delete manually and try again.
❯ ls
dump-PwCalculation-62624/
The dump-PwCalculation-62624
directory is empty. I double checked, and my aiida-core
is at 3adf0c219.
Edit: Fixed in 4657ae132
.
Good catch, @mbercx, thank you! Could just reproduce this. I think I've been so used to using the --overwrite
option that this bug has crept through. Shouldn't take long to fix!
Thanks @GeigerJ2! Can confirm this fixed the problem, am trying to see if I can break anything else. ^^
One question: was the option to write the inputs through the prepare_for_submission
option ever implemented, e.g. in case I want to rerun a pw.x
run and have the pseudos
in the right folder?
Happy to hear that :)
Yes, it was, see here: https://github.com/aiidateam/aiida-core/blob/f49a48e9773dc62b706010a18d22ebc018f06b09/src/aiida/tools/dumping/processes.py#L410-L447
However, as the implementation was very hacky and re-running a multi-step workflow isn't possible without actually re-submitting intermediate steps anyway, we shelved it for now to ensure that this feature makes it into the 2.6 release. We can still add this as an additional option later.