aiida-core icon indicating copy to clipboard operation
aiida-core copied to clipboard

Add CLI command to dump inputs/outputs of `CalcJob`/`WorkChain`

Open qiaojunfeng opened this issue 5 months ago • 35 comments

Currently, we have verdi calcjob inputcat/outputcat for CalcJob, but there is no straightforward way to dump the inputs/outputs of a WorkChain to some files. The use case of this can be:

  • easier to debug the underlying code (e.g. QE pw.x) using the exact same inputs generated by aiida, without working on the python side
  • the files can be easily sent to someone who is not familiar with AiiDA, to reproduce/analyze results

Here I added an example CLI command verdi workchain inputsave, to recursively iterate through the called descendants of a workchain and save the input file of the calledCalcJob into a folder. For example, the structure of the saved folder for a Wannier90BandsWorkChain is


├── 1-scf
│   └── 1-iteration_01
│       └── aiida.in
├── 2-nscf
│   └── 1-iteration_01
│       └── aiida.in
├── 3-wannier90_pp
│   └── 1-iteration_01
│       └── aiida.win
├── 4-pw2wannier90
│   └── 1-iteration_01
│       └── aiida.in
└── 5-wannier90
    └── 1-iteration_01
        └── aiida.win

The current PR is working, but still there are plenty of tasks to be done to make it generic enough.

TODO

  • decide which namespace to use for the command, should be under verdi calcjob, a new verdi workchain, or sth else that combines both calcjob and workchain (since now the command works for both CalcJob and WorkChain)?
  • the folder structure, should uuid be included in the folder names? Currently I am using link labels

comments from Giovanni

  • add a Readme or Metadata.txt or similar in each folder, with info eg on time, uuid,... (also to help AiiDA users to find back the nodes)
  • have a way to also dump any file from input links (eg have subfolders inputnode/pseudo_Ba, inputnode/pseudo_Ti,...)

Further optional TODO

In addition, there are several other convenience CLI commands I often use, they are here https://github.com/aiidateam/aiida-wannier90-workflows/blob/main/src/aiida_wannier90_workflows/cli/node.py

  • inputcat/outputcat for workchain, that by default shows the last calcjob and allows user to specify a link_label and goto the corresponding calcjob
  • cleanworkdir for workchain, that clean up all the nested calcjob of a workchain
  • gotocomputer for workchain / RemoteData / RemoteStashFolderData

Would be great if someone versed in aiida-core could use these as inspirations, and write some generic CLI commands that cover these usages. Thanks!

qiaojunfeng avatar Feb 06 '24 18:02 qiaojunfeng

As a note, I assigned this to Ali and Julian (they are not yet in the project so I cannot tag them), and @eimrek was involved in the discussions with @qiaojunfeng

giovannipizzi avatar Feb 07 '24 12:02 giovannipizzi

Thanks @qiaojunfeng . Interesting idea, but I have some questions about the design. How does the output structure of this command make it easier to rerun the calculations manually? If I understand correctly, the command only dumps the main input file, but that is typically not sufficient as there can be multiple input files for a CalcJob. I feel that the implementation may be a bit tied to your use case of QE, where there is a single principle input file. If we were to accept this in aiida-core, we need to make sure there are no assumptions that are based on a specific code, but it should be generally applicable.

That being said, even for QE I am not sure if this is very useful. As is, the generated output structure cannot be directly used to launch the calculations, can it? For example, the pseudopotentials are not written. So a user will have to manually add those (for each subfolder). And what about restart files? For the NSCF, the restart files won't be found and the user will have to manually add those.

What I think the command should really do is copy all input files such that it can be relaunched straight away. The implementation will be a bit more involved though. The total set of input files is only directly available from the remote working directory, but that is not stored permanently by AiiDA and can disappear, so we cannot rely on that. Some input files are copied to the repo of the CalcJobNode which can be easily dumped using verdi node repo dump, but that also doesn't contain all files. Some of the written input files can be excluded by the CalcJob plugin through the provenance_exclude_list and other files are written only to the remote working directory by the daemon, for example files specified through FolderData, RemoteFolder and SinglfileData nodes. So to reconstruct the complete set of input files to an arbitrary path, the command should really call CalcJob.presubmit followed by aiida.engine.daemon.execmanager.upload_calculation. This would therefore require some refactoring in aiida-core to make this nicely doable.

That all being said, in some cases we may not want to copy all input files. Taking QE as an example, if you want to dump the inputs for an NSCF calculation, the complete dump would include the restart files, but these can get very big and so in some cases the user may not want to dump these. So you would have to add options to exclude these.

I can see how this command is very useful for workchains of QE calculations, but in order to justify adding this to aiida-core, we have to think hard how this can be useful for the generic use case as well. Otherwise it might just be better to keep it in the code specific plugin package.

sphuber avatar Feb 08 '24 09:02 sphuber

Hi @sphuber - indeed we have to make it generic. As you see in one of my comments that @qiaojunfeng reported, I had thought to this. One idea was to report the raw files of any direct input and link them with the correct input link name. so e.g. for QE you would have a subfolder (per calcjob) raw_inputs/, for instance, an inside for instance pseudo_Ba/Ba.UPF, pseudo_Ti/Ti.UPF, ...

Then we also considered that indeed it might be instead better to just run the presubmit, essentially running a dry run; this should already do essentially everything (putting also the pseudos; and leaving information on which files should be copied directly on the remote for restarts).

Note that the goal here is not that one should necessarily be able to just run again in the same folder and everything works. (Of course, if we achieve this, this is much better). The goal is that an experienced user (not of AiiDA, but of the underlying code - in our case, this was triggered by a developer of CP2K not interested in learning AiiDA) can inspect the raw inputs and outputs, to see what was run. It's OK in my opinion to give minimal "human-readable" instructions to them (possibly plugin specific) on what to do if they really want to rerun (things like: "for each folder, move the pseudos from folder XX to folder YY and rename it accordingly; if it's a restart, you first have to copy this folder from the parent to the restart folder, where "this folder" is written in e.g. a 'calcinfo.json' or similar file; etc.). As long as these are short and minimal, a code developer will understand what to do (possibly even without).

Again, if we can really make this "just runnable" (at least without restarts) that's OK, but I don't think it's crucial for the main goal of exposing the raw files to interested people (who don't want to dig into how AiIDA works).

The PR itself needs some cleanup indeed. The suggestion is that @khsrali and @GeigerJ2 work on this in the next couple of weeks, and then we discuss if this is general enough. But of course further comments are welcome also to help them fix this PR.

giovannipizzi avatar Feb 08 '24 22:02 giovannipizzi

Hi @qiaojunfeng, thank you for this contribution! I think this is a very useful feature for people that are starting with AiiDA as it eases the transition from the directory tree to a more AiiDA-centric approach. I made some changes to the code locally. Is it OK for you if I commit them and continue working directly on top of this PR?

During development, I used a simple, exemplary PwBandsWorkChain (I'll also look at some aiida-vasp workchains in the future). Just to mention the three main points that came up:

  • The call to get_option(): https://github.com/aiidateam/aiida-core/blob/1410179d0d6021aacb64d43da77bdc3341590de3/src/aiida/cmdline/commands/cmd_workchain.py#L53 failed for the create_kpoints_from_distance step, as the resulting CalcFunctionNode does not implement that method. Thus, I filtered the list of links via: links = [link for link in links if not isinstance(link.node, orm.CalcFunctionNode)] to remove those. While this could also be resolved by checking the node type of the links in the _get_input_filename function, in the current implementation, a subdirectory is still created for these CalcFunctionNodes, even though there are no SingleFileData objects to be written to disk (though, as we'll continue working on it, this point will probably not be relevant).
  • For the steps for which the link_labels were not specifically set in the plugin, I'm using the process_labels to get a more descriptive directory tree, e.g.:
      wc-387
      ├── 1-relax
      │   ├── 1-PwBaseWorkChain
      │   │   └── 1-PwCalculation
      │   └── 2-PwBaseWorkChain
      │       └── 1-PwCalculation
      ├── 2-scf
      │   └── 1-PwCalculation
      └── 3-bands
          └── 1-PwCalculation
    
  • I'm dumping the outputs using the copy_tree() method via .outputs.retrieved from the respective CalcJobNode. I was expecting it would be similarly straightforward for multiple input files, but haven't found a way so far (I'll try the suggestion given by @giovannipizzi).

Further, I think dumping the main input/output files for easy inspection vs. dumping everything to be able to fully restart the simulations are two different use cases, and it would already be valuable to have the former. In general, I also think the user should have the option to manually include/exclude additional files.

Apart from the QE-specific assumption of a single input file, the function seems conceptually generic, and I would expect it to work similarly for other workchains, or am I missing something?

And lastly, regarding the namespace, it seems natural to assume that something like verdi workchain could also exist, and probably more functionalities that fit into cmd_workchain will come up over time. Though, in this case, I'd only keep the recursive logic here and make the dumping of input/outputfiles accessible for the CalcJobs themselves. Curious to hear your thoughts!

GeigerJ2 avatar Feb 13 '24 18:02 GeigerJ2

Thanks for your comments @GeigerJ2 . Just some quick pointers I noticed when reading them:

Thus, I filtered the list of links via: links = [link for link in links if not isinstance(link.node, orm.CalcFunctionNode)] to remove those.

This is indeed a solution, but there might be a better one. Instead of using a "black list" of sorts with known exceptions, you might want to just catch the AttributeError that is thrown when get_option is not defined. This will make the code more robust as there might be other cases you didn't think off. I know for sure that this one will fail for example if there is a WorkFunctionNode.

For the steps for which the link_labels were not specifically set in the plugin, I'm using the process_labels to get a more descriptive directory tree, e.g.:

Maybe a combination of the two would be even better? If you have multiple subsequent calls of the same subprocess, having just the same name prefixed with an integer is also not so informative. If a call_link_label is defined, it means the work chain developer put it there intentionally.

I'm dumping the outputs using the copy_tree() method via .outputs.retrieved from the respective CalcJobNode. I was expecting it would be similarly straightforward for multiple input files, but haven't found a way so far (I'll try the suggestion given by @giovannipizzi).

Input files (at least some of them) of a CalcJobNode are stored in its own repository. So you can get them with the same API, e.g., CalcJobNode.base.repository.copy_tree(). As I mentioned in my previous post, however, this does not include all inputs. Files that were copied using the local_copy_list, or remote_copy_list constructs will not be there. Similarly, even files that were copied to the sandbox folder by the plugin, but that were listed in the provenance_exclude_list, will not be present. For QE, typically this will include pseudos, so you won't find them in the calcjobnode's repository. And this brings me back to my earlier point...

Apart from the QE-specific assumption of a single input file, the function seems conceptually generic, and I would expect it to work similarly for other workchains, or am I missing something?

It might "work" as in there won't be any exceptions, but I still have my doubts whether it is really all that useful for the general case. Defining a default input file is not required and maybe not all that common. The current implementation will even raise an exception in that case. It says to specify a path, but that is not implemented and wouldn't even work if you have different CalcJobs in a tree that have different input paths. So I think in most cases you will get only a fraction of input files, if any at all.

If this command is to be useful, we have to go about it slightly differently. I would suggest we have to go through the same logic as an actual CalcJob goes in writing its input based on the input nodes that are stored in the database. This functionality is essentially in the upload_calculation function of aiida.engine.daemon.execmanager. Instead of the the current implementation, we should try to reuse that code. This will take care of including files that were copied using local_copy_list and we could refactor it to ignore the provenance_exclude_list. We would have to refactor the code a bit to allow the remote_copy_list to actually copy from the remote to the localhost. @mbercx has an open PR https://github.com/aiidateam/aiida-core/pull/6196 that proposes already to make this logic a bit more permissive. We could think of allowing it also to copy to the localhost.

This solution would require quite some refactoring, but I think this might anyway not be such a bad idea. The code there can be a bit dense. But if we think this might be an interesting approach, I'd be happy to go through it during a call some time and give some pointers and guide you in performing the refactoring.

sphuber avatar Feb 13 '24 19:02 sphuber

Hi all, great comments! I also thought of using the same logic that is used in the presubmit step. However, I realize now that this will require that the people who want to get the files will need to install all the plug-ins of the work chains they want to traverse. Think in 5 years time, they might be using a new AiiDA and the plug-ins will not be installable anymore with that version, might be in a new non backward compatible version etc.

We might decide to give both options, but I think the current approach is general enough to start to be useful. You can still dump a yaml version of the info in local copy list, remote copy list etc (we anyway already dump a json of those). And explain people what it means. And in any case also the presubmit approach, for the remote copy list, can only explain what needs to be done and not do it.

The design idea we had is to have, optionally, the possibility for a plug-in to provide its custom wait to dump files. So if you user has the plug-in install and the plug-in supports it, the dump will be done in a more plug-in specific way. But the current suggested approach, to me, is a good default when such plug-in specific function does not exist.

In practice, I think that the best is to try to run this function against a few common use cases (qe, wannier, vasp, fleur etc) and see what we get and what we miss. In practice, I would do it for the various archives of the recent common workflows paper on verification (that is the actual usecase that triggered this pr), and then ask the respective code developers if the resulting folder is clear enough and contains all info they might need.

giovannipizzi avatar Feb 13 '24 22:02 giovannipizzi

However, I realize now that this will require that the people who want to get the files will need to install all the plug-ins of the work chains they want to traverse. Think in 5 years time, they might be using a new AiiDA and the plug-ins will not be installable anymore with that version, might be in a new non backward compatible version etc.

Good point. It would require the plugin to be installed to do what I am proposing. I can see how that is limiting.

And in any case also the presubmit approach, for the remote copy list, can only explain what needs to be done and not do it.

What I was suggesting was not just calling the presubmit, but also the aiida.engine.daemon.execmanager.upload_calculation which reads those instructions generated by the plugin and copies local and remote files. This would just require some refactoring but in principle it is perfectly possible. But this still requires the actual plugin to be isntalled as the CalcJob.prepare_for_submission is called before that.

In practice, I think that the best is to try to run this function against a few common use cases (qe, wannier, vasp, fleur etc) and see what we get and what we miss. In practice, I would do it for the various archives of the recent common workflows paper on

Note that these are still just use cases that all come from a single domain, namely the original DFT users, and don't necessearily represent the general case. If this is really a use case that stems from these plugins, and especially from the common workflow project, is it maybe not also an idea to consider adding it to the aiida-common-workflows package?

sphuber avatar Feb 13 '24 22:02 sphuber

Thank you for your helpful and valuable feedback, @sphuber!

you might want to just catch the AttributeError that is thrown when get_option is not defined.

This was my original solution, but in the initial implementation of @qiaojunfeng, it led to the creation of a directory also for these CalcFunction steps of the workchain that did not use/create input/output-files.

This will make the code more robust as there might be other cases you didn't think off. I know for sure that this one will fail for example if there is a WorkFunctionNode.

Yes, good point!

Maybe a combination of the two would be even better?

Agreed!

Input files (at least some of them) of a CalcJobNode are stored in its own repository. So...

Just learned something new about the internals :)

I'd be happy to go through it during a call some time and give some pointers and guide you in performing the refactoring.

If you have the time, that would be great!

In practice, I would do it for the various archives of the recent common workflows paper

Thanks, @giovannipizzi for this pointer, that should give me plenty of workchains to try out when working on this.

Note that these are still just use cases that all come from a single domain, namely the original DFT users, and don't necessearily represent the general case.

The general use case should be WorkChains that call CalcJobs that have file I/O on HPC (or locally), no? While the codes given by @giovannipizzi are from the field of DFT, I don't think they are otherwise too specific in that regard? Once we have use cases from the weather and climate community in the SwissTwins project, we can also see how the functionality could be of use there.

GeigerJ2 avatar Feb 15 '24 11:02 GeigerJ2

Hello everybody, I took the PR and created a branch from it on my local fork of core to be able to work freely on it, without messing anything else up. I hope this follows proper GitHub ettiquette - if not, please let me know :) I did some changes and a first proof-of-concept for dumping the main WorkChain files works. Though, I wanted to run it by you before continuing with it. A more in-depth explanation is given in the commit message. Thank you!

GeigerJ2 avatar Feb 19 '24 17:02 GeigerJ2

Thanks a lot @GeigerJ2

I took the PR and created a branch from it on my local fork of core to be able to work freely on it, without messing anything else up. I hope this follows proper GitHub ettiquette

That is perfectly fine because, as you say, the original remains untouched. There is no set rule here really and it depends on the situation. It is possible for you to, when you are done, make a PR to the original branch, which then automatically updates this PR. But if the original author is fine with it, they can also give you direct write access to their branch and update there directly. Or just close the PR and you open a new one. Fully up to you and @qiaojunfeng in this case.

I did some changes and a first proof-of-concept for dumping the main WorkChain files works. Though, I wanted to run it by you before continuing with it. A more in-depth explanation is given in the commit message. Thank you!

As requested, I will first give some high-level feedback on the design and some common principles we tend to keep in aiida-core.

  1. For any CLI functionality that we provide, it is good practice to first make this available just in the Python API, and then have the CLI command be a very thin wrapper around it. This makes it possible to use the functionality not just from verdi but also directly from Python. The functionality you are adding here, could typically be placed somewhere in aiida.tools which contains more functions of this type that performs some post processing of provenance data.
  2. Do we really need explicit CLI/API endpoints for the calcjob inputdump and outputdump? These are essentially just a single method call CalcJobNode.base.repository.copy_tree and CalcJobNode.outputs.retrieved.copy_tree. I don't think we need specific functions for this in the API and the CLI already has verdi node repo dump that provides this for both cases.
  3. Naming: I see there is quite a bit of logic to determine the naming of the directory structure. Would it be useful to keep the same naming as much as possible as other tools would give it? I am thinking of verdi node graph generate and verdi process status that essentially also give an overview of the callstack. It would be useful I think if the naming matches between these tools, otherwise it may be difficult to match the various outputs. Would the following format work perhaps: {index}-{call_link_label}-{process_label}. Where call_link_label is omitted if it is not defined or just the CALL defined.
  4. What is the reason you prefer for constructing the folder structure up front, before copying the files? If you make a single recursive function that goes through the call stack that for each CalcJobNode just copies inputs and outputs, creating the output directory if it doesn't already exist. Something like the following (not tested):
import pathlib
from aiida.common.links import LinkType
from aiida.orm import CalcJobNode, WorkChainNode


def workchain_dump(workchain: WorkChainNode, filepath: pathlib.Path) -> None:
    called_links = workchain.base.links.get_outgoing(link_type=(LinkType.CALL_CALC, LinkType.CALL_WORK)).all()

    for index, link_triple in enumerate(sorted(called_links, key=lambda link_triple: link_triple.node.ctime)):
        node = link_triple.node

        if link_triple.link_label != 'CALL':
            label = f'{index:02d}-{link_triple.link_label}-{node.process_label}'
        else:
            label = f'{index:02d}-{node.process_label}'

        if isinstance(node, WorkChainNode):
            workchain_dump(node, filepath / label)
        elif isinstance(node, CalcJobNode):
            node.base.repository.copy_tree(filepath / label)
            node.outputs.retrieved.copy_tree(filepath / label)

Let me know what you think.

sphuber avatar Feb 20 '24 09:02 sphuber

Hi @sphuber, thanks a lot for your valuable feedback, it is highly appreciated!

That is perfectly fine

Happy to hear that :)

But if the original author is fine with it, they can also give you direct write access to their branch and update there directly.

In principle, that should be the default anyway if the author hasn't unset "Allowing edits by maintainers" when making the PR, no? Though, better to ask ofc.

Regarding your feedback:

  1. Yes, fully agreed! Felt a bit weird to have everything in the cmd_ file, but I wasn't sure where the code would belong otherwise.

  2. Ah, I wasn't actually aware of verdi node repo dump... maybe this is also something we could highlight better (see #6290), as it seems like a very useful command to know, especially for beginners? Though, at least for a PwCalculation it only dumps the input and submission file, and .aiida folder*, so one would still be missing the retrieved files. Apart from that, is there a particular reason why verdi node repo dump implements another _copy_tree, rather than using node.base.repository.copy_tree()?

    I don't have a strong opinion about the API endpoints, though dumping multiple input/output files might be a functionality new users might expect, e.g. for debugging their submissions without having to use the AiiDA API (even though they should :D).

  3. For the naming, I'm fine with {index}-{call_link_label}-{process_label}, and I agree it should be consistent. I abbreviated the AiiDA classes, but maybe better to keep the full names to avoid any confusion. We just need to make sure that, if multiple lower-level WorkChains are called in a higher-level WorkChain, or multiple CalcJobs in one WorkChain, that they are properly numbered, but it seems like this is already taken care of with the link_label :+1:

  4. I split it into multiple functions to get somewhat of a separation of concerns, assuming that in the future, we might want to add other WorkChain commands, where the conversion of the WorkChain to a directory tree construct could be useful. For example, if one wants to only dump either the input files, or the output files, this logic would need to be set within the recursive function, no? Additionally, as mentioned before, we might also want to implement a version that, given the necessary plugin being installed, more (all) input files are written (e.g. the pseudos for QE, ideally, making the calculation fully restartable), then the question would be again where to implement this logic.

(*I assume this is intentional, as the .aiida/calcinfo.json contains the references to the other input files, via local_copy_list, remote_copy_list and the output files via retrieve_list, so provenance is recorded, and the full file structure could be reconstructed?)

GeigerJ2 avatar Feb 20 '24 15:02 GeigerJ2

maybe this is also something we could highlight better (see https://github.com/aiidateam/aiida-core/issues/6290), as it seems like a very useful command to know, especially for beginners?

Sure. The problem is though that there are so many potentially useful things that it is difficult to organize. Of course you could start adding them to a single page, but this would quickly get very chaotic and disorganized. Then you start organizing it and then you end up with the same problem as in the beginning where you have to know where to look to find something. Not saying we shouldn't try to solve this, just that it might not be as trivial as it initially seems.

Though, at least for a PwCalculation it only dumps the input and submission file, and .aiida folder*, so one would still be missing the retrieved files.

Sure, you would have to call it twice, once for the node it self, and once for its retrieved output. Just as your code is doing, you have inputdump and outputdump.

Apart from that, is there a particular reason why verdi node repo dump implements another _copy_tree, rather than using node.base.repository.copy_tree()?

Probably just because that command got added before I implemented copy_tree on the repository (which was quite recent) 😅 It'd be great if you want to open a PR to simplify that 👍

I don't have a strong opinion about the API endpoints, though dumping multiple input/output files might be a functionality new users might expect, e.g. for debugging their submissions without having to use the AiiDA API (even though they should :D).

Not quite sure what you mean here. But I am not saying that we shouldn't have the CLI end points. I am just saying that the functionality should be mostly implemented as normal API, such that people can also use it from the shell or a notebook. The CLI commands can then be minimal wrappers. But maybe I misunderstood your comment here?

We just need to make sure that, if multiple lower-level WorkChains are called in a higher-level WorkChain, or multiple CalcJobs in one WorkChain, that they are properly numbered, but it seems like this is already taken care of with the link_label 👍

That is what the {index} is doing right? I am not relying on the link_label for this.

I split it into multiple functions to get somewhat of a separation of concerns, assuming that in the future, we might want to add other WorkChain commands, where the conversion of the WorkChain to a directory tree construct could be useful. For example, if one wants to only dump either the input files, or the output files, this logic would need to be set within the recursive function, no? Additionally, as mentioned before, we might also want to implement a version that, given the necessary plugin being installed, more (all) input files are written (e.g. the pseudos for QE, ideally, making the calculation fully restartable), then the question would be again where to implement this logic.

This really feels like a pre-mature abstraction, and don't think it would even necessarily be a good one. I think the difference in code length and complexity is significant and don't think the separate version gives a real benefit, but it is significantly more difficult to understand and maintain. Unless there are concrete benefits now, I would not overcomplicate things.

sphuber avatar Feb 20 '24 16:02 sphuber

Not saying we shouldn't try to solve this, just that it might not be as trivial as it initially seems.

Yeah, it would definitely require putting some thought into it. Though, there are a few examples that seem to pop up frequently, and that one doesn't come across when just going through the tutorials, i.e. gotocomputer.

Probably just because that command got added before I implemented copy_tree on the repository (which was quite recent) 😅 It'd be great if you want to open a PR to simplify that 👍

Alright, I see. Sure, will do!

Not quite sure what you mean here. But I am not saying that we shouldn't have the CLI end points. I am just saying that the functionality should be mostly implemented as normal API, such that people can also use it from the shell or a notebook. The CLI commands can then be minimal wrappers. But maybe I misunderstood your comment here?

I mean that having verdi calcjob inputdump and verdi calcjob outputdump would be nice to have in the CLI, but yes, I agree that it's best to have the actual (simple) implementation as part of the normal API, and just have wrappers for it in cmd_calcjob. So I think I just misunderstood your comment.

This really feels like a pre-mature abstraction, and don't think it would even necessarily be a good one. I think the difference in code length and complexity is significant and don't think the separate version gives a real benefit, but it is significantly more difficult to understand and maintain. Unless there are concrete benefits now, I would not overcomplicate things.

Good point there! I'll revert it back to an implementation with just the one recursive function, based on your snippet, in the appropriate location, making sure it works, and

That is what the {index} is doing right? I am not relying on the link_label for this.

seeing if the indexing checks out for some examples. Thanks again for your assistance with this, it's very insightful! :)

GeigerJ2 avatar Feb 20 '24 21:02 GeigerJ2

(Copying the commit message here, which I thought would appear here in that way, sry for that... :upside_down_face:)

Hey all,

I commited the changes from my local fork of the PR now here directly, as I think that will make working on it more straightforward. Hope that's OK for everybody.

Based on @qiaojunfeng's original implementation, there's a recursive function that traverses the WorkChain hierarchy. Now, as actual file I/O is only done for the CalcJobNodes involved in the workchain, I factored that out into its own function _calcjob_dump. I also moved these implementations into tools/dumping/processes.py so that it's available via the normal Python API. In cmd_workchain and cmd_calcjob the commands verdi workchain dump and verdi calcjob dump are only wrappers around the respective functions.

_calcjob_dump, by default, dumps the node.base.repository and node.outputs.retrieved using copy_tree, as well as the input_nodes of the calcJobNode if they are of type SingleFileData or FolderData. I started working on the --use-prepare-for-submission option, but as previously indicated by @sphuber (also see here), it's not straightforward to make it work as intended, so I put that on hold for now and added a warning.

For each WorkChainNode and CalcJobNode a selected set of the node @propertys are dumped to yaml files. extras and attributes can also be included via the cli. I initially had a function for this, but realized that I'm just passing these arguments all the way through, so I encapsulated that in the the ProcessNodeYamlDumper class. Now, an instance is created when the respective cli command is run, and the arguments are set as instance variables, with the instance being passed through, rather than the original arguments. The dump_yaml method is then called at the relevant positions with the process_node and output_path arguments. Regarding relevant positions: To get the node yaml file for every ProcessNode involved, it's called in cmd_workchain and cmd_calcjob for the parent node, and subsequently for the outgoing links. Maybe there's a better way to handle that?

The other commands initially mentioned by @qiaojunfeng also seem very interesting and could probably easily added based on his original implementation, though we should agree on the overall API first.

A few more notes:

  • The cli options for verdi workchain dump and verdi calcjob dump are basically the same and the code is duplicated. We could avoid that by merging it into one, e.g. under verdi node dump, however, as a user, I would find verdi workchain dump and verdi calcjob dump more intuitive. Also, verdi node repo dump uses a former implementation of copy_tree, so I'd go ahead and update that, but in a separate PR (related, do we want to also implement (or change it to) just verdi node dump?)
  • Regarding the yaml dumping, the ProcessNodeYamlDumper class is quite specific and just a little helper to get the job done here. Should we generalize such functionality, e.g. to resolve issue #3521 to allow for dumping of computers/codes to yaml, or keep these things separate?
  • Currently, the pseudo directories are called pseudos__<X>, and I thought about splitting on the double underscore to just have one pseudos directory with a subdirectory for each element. Personally, I'd find that nicer than having a bunch of pseudos__<X>, but I'm not sure if the double underscore is based on general AiiDA name mangling, or specific to aiida-quantumespresso.

Lastly, examples of the (default) outputs obtained from verdi workchain dump:

dump-462
├── 01-relax-PwRelaxWorkChain
│  ├── 01-PwBaseWorkChain
│  │  ├── 01-PwCalculation
│  │  │  ├── aiida_node_metadata.yaml
│  │  │  ├── node_inputs
│  │  │  │  └── pseudos__Si
│  │  │  │     └── Si.pbesol-n-rrkjus_psl.1.0.0.UPF
│  │  │  ├── raw_inputs
│  │  │  │  ├── .aiida
│  │  │  │  │  ├── calcinfo.json
│  │  │  │  │  └── job_tmpl.json
│  │  │  │  ├── _aiidasubmit.sh
│  │  │  │  └── aiida.in
│  │  │  └── raw_outputs
│  │  │     ├── _scheduler-stderr.txt
│  │  │     ├── _scheduler-stdout.txt
│  │  │     ├── aiida.out
│  │  │     └── data-file-schema.xml
│  │  └── aiida_node_metadata.yaml
│  ├── 02-PwBaseWorkChain
│  │  ├── 01-PwCalculation
│  │  │  ├── aiida_node_metadata.yaml
│  │  │  ├── node_inputs
│  │  │  │  └── pseudos__Si
│  │  │  │     └── Si.pbesol-n-rrkjus_psl.1.0.0.UPF
│  │  │  ├── raw_inputs
│  │  │  │  ├── .aiida
│  │  │  │  │  ├── calcinfo.json
│  │  │  │  │  └── job_tmpl.json
│  │  │  │  ├── _aiidasubmit.sh
│  │  │  │  └── aiida.in
│  │  │  └── raw_outputs
│  │  │     ├── _scheduler-stderr.txt
│  │  │     ├── _scheduler-stdout.txt
│  │  │     ├── aiida.out
│  │  │     └── data-file-schema.xml
│  │  └── aiida_node_metadata.yaml
│  └── aiida_node_metadata.yaml
├── 02-scf-PwBaseWorkChain
│  ├── 01-PwCalculation
│  │  ├── aiida_node_metadata.yaml
│  │  ├── node_inputs
│  │  │  └── pseudos__Si
│  │  │     └── Si.pbesol-n-rrkjus_psl.1.0.0.UPF
│  │  ├── raw_inputs
│  │  │  ├── .aiida
│  │  │  │  ├── calcinfo.json
│  │  │  │  └── job_tmpl.json
│  │  │  ├── _aiidasubmit.sh
│  │  │  └── aiida.in
│  │  └── raw_outputs
│  │     ├── _scheduler-stderr.txt
│  │     ├── _scheduler-stdout.txt
│  │     ├── aiida.out
│  │     └── data-file-schema.xml
│  └── aiida_node_metadata.yaml
├── 03-bands-PwBaseWorkChain
│  ├── 01-PwCalculation
│  │  ├── aiida_node_metadata.yaml
│  │  ├── node_inputs
│  │  │  └── pseudos__Si
│  │  │     └── Si.pbesol-n-rrkjus_psl.1.0.0.UPF
│  │  ├── raw_inputs
│  │  │  ├── .aiida
│  │  │  │  ├── calcinfo.json
│  │  │  │  └── job_tmpl.json
│  │  │  ├── _aiidasubmit.sh
│  │  │  └── aiida.in
│  │  └── raw_outputs
│  │     ├── _scheduler-stderr.txt
│  │     ├── _scheduler-stdout.txt
│  │     ├── aiida.out
│  │     └── data-file-schema.xml
│  └── aiida_node_metadata.yaml
└── aiida_node_metadata.yaml

and verdi calcjob dump:

dump-530
├── aiida_node_metadata.yaml
├── node_inputs
│  └── pseudos__Si
│     └── Si.pbesol-n-rrkjus_psl.1.0.0.UPF
├── raw_inputs
│  ├── .aiida
│  │  ├── calcinfo.json
│  │  └── job_tmpl.json
│  ├── _aiidasubmit.sh
│  └── aiida.in
└── raw_outputs
   ├── _scheduler-stderr.txt
   ├── _scheduler-stdout.txt
   ├── aiida.out
   └── data-file-schema.xml

for a PwBandsWorkChain and one of its involved CalcJobNodes.

GeigerJ2 avatar Feb 28 '24 13:02 GeigerJ2

Thanks a lot @GeigerJ2 , great progress 👍

I started working on the --use-prepare-for-submission option, but as previously indicated by @sphuber (also see here), it's not straightforward to make it work as intended, so I put that on hold for now and added a warning.

The current branch seems to implement it though. Does it not work in its current state?

For each WorkChainNode and CalcJobNode a selected set of the node @propertys are dumped to yaml files. extras and attributes can also be included via the cli. I initially had a function for this, but realized that I'm just passing these arguments all the way through, so I encapsulated that in the the ProcessNodeYamlDumper class. Now, an instance is created when the respective cli command is run, and the arguments are set as instance variables, with the instance being passed through, rather than the original arguments. The dump_yaml method is then called at the relevant positions with the process_node and output_path arguments. Regarding relevant positions: To get the node yaml file for every ProcessNode involved, it's called in cmd_workchain and cmd_calcjob for the parent node, and subsequently for the outgoing links. Maybe there's a better way to handle that?

I think I would include the dump_yaml call inside recursive_dump (only when the node is a WorkChainNode and _calcjob_dump itself. This way, the user doesn't have to think to call it manually for the parent node. On that node, why not call it workchain_dump and calcjob_dump. The fact that the workchain_dump will be recursive can be made clear in the docstring, and for the calcjob_dump it doesn't make sense to be recursive as a CalcJob cannot call other processes.

The cli options for verdi workchain dump and verdi calcjob dump are basically the same and the code is duplicated. We could avoid that by merging it into one, e.g. under verdi node dump, however, as a user, I would find verdi workchain dump and verdi calcjob dump more intuitive.

You can also reduce the code duplication differently. If you were to go for my previous suggestion, then the only code besides the recursive_dump function call is just setting the default for the output path and validation. That can simply be included in the option declaration itself, which can provide the default and do the validation. Then you can define the options once using the OverridableOption class that we use (see aiida.cmdline.params.options.main) and then you can easily reuse the option in both commands.

Also, verdi node repo dump uses a former implementation of copy_tree, so I'd go ahead and update that, but in a separate PR (related, do we want to also implement (or change it to) just verdi node dump?)

I would not change the name, because it currently only dumps the content of the repo, so in my mind verdi node repo dump is accurate.

Regarding the yaml dumping, the ProcessNodeYamlDumper class is quite specific and just a little helper to get the job done here. Should we generalize such functionality, e.g. to resolve issue #3521 to allow for dumping of computers/codes to yaml, or keep these things separate?

Yes, I would be strongly in favor of this as opposed to one of solutions. This PR would actually essentially provide this functionality: https://github.com/aiidateam/aiida-core/pull/6255 It specifies a pydantic model for each ORM entity class (so including ProcessNode) and that allows to have pydantic automatically dump the content to JSON (having it do YAML should also be trivial)

Currently, the pseudo directories are called pseudos__<X>, and I thought about splitting on the double underscore to just have one pseudos directory with a subdirectory for each element. Personally, I'd find that nicer than having a bunch of pseudos__<X>, but I'm not sure if the double underscore is based on general AiiDA name mangling, or specific to aiida-quantumespresso.

You could indeed split on the double underscore and nest again, but then you would get paths pseudos/Si/Si.pbe.blabla.upf. Guess that would be down to preference if you prefer another level or nesting or keep the underscores in the folder name.

The double underscores indeed come from some name mangling. Inputs in AiiDA can be nested, e.g. pseudos for the PwCalculation are passed as

inputs = {
    'pseudos': {
        'Si': UpfData()
    }
}

The link labels are flat though, so nested namespaces are represented by double underscores.

In a sense then, splitting on the double underscores and making that a subdirectory would map quite nicely on nested input namespaces. But for me, either solution would be fine.

sphuber avatar Feb 28 '24 17:02 sphuber

Hi @sphuber, sorry for just getting back to you now. Thanks, happy to hear that :)

The current branch seems to implement it though. Does it not work in its current state?

Currently, it's giving this warning:

Warning: CalcJobNode<pk> already has a `remote_folder` output: skipping upload

and the pseudo directory is empty. Though, I have to admit that I haven't looked much further into it until now.

I have just implemented your proposed changes. Thank you for your inputs and pointers. The cli options are now OverridableOptions, and passed to both dumping functions. Though, I was not sure how to set the default for the path at this stage, as it depends on the pk of the process, so instead I added that as a function to cmdline.utils.defaults.

I would not change the name, because it currently only dumps the content of the repo, so in my mind verdi node repo dump is accurate.

I meant also extending the dumping so that not only the repo of the node is dumped. In that case verdi node dump could dump the repo and e.g. retrieved outputs. But that is effectively what verdi calcjob dump and verdi workchain dump are doing now, so no need to change or rename anything.

Yes, I would be strongly in favor of this as opposed to one of solutions. This PR would actually essentially provide this functionality: #6255

Fully agree! The PR looks like quite the extensive effort. Is there any estimated timeline for it? That is to say, would merging here with a temporary solution be reasonable, or better to hold off until PR #6255 is merged. I'm riding along on the pydantic hype train, so if I can be of any help there, please let me know!

The double underscores indeed come from some name mangling.

Good, so my intuition was correct :)

But for me, either solution would be fine.

Same here, don't have a strong opinion on that.

GeigerJ2 avatar Mar 04 '24 14:03 GeigerJ2

Fully agree! The PR looks like quite the extensive effort. Is there any estimated timeline for it? That is to say, would merging here with a temporary solution be reasonable, or better to hold off until PR #6255 is merged. I'm riding along on the pydantic hype train, so if I can be of any help there, please let me know!

There is no fixed timeline, but it is ready for review and has been for a while. @mbercx and @edan-bainglass have both expressed interest in taking a look at this PR in the past as well as its predecessor #6190 I would be in favor of having this in soon. Since it is backwards compatible, I don't see major blockers really

sphuber avatar Mar 04 '24 19:03 sphuber

Dear all, great work, let me try to be the "bad guy" as much as possible.

I use the case of the data for CP2K oxides calculations in materialsclous:2023.81

Suppose we are back 10 years ago, someone like me, "A", was doing the calculations and for some reason putting them OA. Suppose also someone like me "B" was upset with some CP2K related results and was willing to check what went wrong.

A would have had in his working space several directories such as HfO2, SiO2, Ti2O3,... after all calculations would have been compelted and ready to be published he would have done tar -cvzf cp2k_oxides.tgz oxides_dir and put the file online. Though point here was... do I really have everything well organized..? This problem does not exist anymore with AiiDA

B Downloades the tar file, executes tar -xvf cp2k_oxides.tgz expecting to see something like HfO2 somewhere, enter sthe HfO2 directory and inspects what was wrong.

What I did now with AiiDA was:

  • find an available aiida environment and bring there the cp2k_oxides.aiida archive file
  • exceute verdi archive import cp2k_oxides.aiida

The two lines above will be replaced, this is a BIG PLUS, by a link to renku or an equivalent solution

To inspect the content of the archive:

  • I executed verdi process list -a | grep -y work
  • identified visually the "main workchains" (in this case Cp2kCommonRelaxWorkChain)

Here comes the big disadvantage now compared to 10 years ago: image

Path to solution: -assuming A assigned appropriate descriptions to each workchain, either we create a verdi command (or provide a script.py to be executed via verdi) to return pk description of, e.g., all "main workchains" or the instance activated e.g. in renku already shows such a list (running automatically a query builder script). The command verdi workchain dump pk could then use part of the description available in creating the dumping directories -assuming A was lazy, with no descriptions in the workchains, we could still imagine an automated script in renku (or a verdi run script.py ) that, based on the first input StructureData encountered in each "main workchain", will provide to the user a guess description. The same label could then be used by verdi workchain dump

cpignedoli avatar Mar 13 '24 07:03 cpignedoli

Thanks @cpignedoli for the writeup. I agree that it can be a bit daunting for a user to start to "browse" the contents of an imported archive. Without a high-level human-readable description that accompanies the archive, it can be difficult for a user to know how to analyse the data, especially if they are not familiar with the plugins used and so they don't necessarrily know or understand the workchain layouts. I fully agree that it would be great if we could improve this intelligibility automatically somehow.

That being said, I want to caution about implementing any solutions that are going to be to prejudiced by our own typical use cases that build in assumptions about the structure of archives. For example, in your suggestion, you implicitly assume that all archives are a simple collection of a number of top-level workchains of the same type and that their main input is a StructureData. This is not really the case in the general sense though. You could have an archive with many different types of top-level workchains, and their main input may not be structures.

What I imagine would really help here is the following:

Have a dead easy way to "mount" an AiiDA archive that allows to inspect it

The Renku functionality is a great example to solving this. We have recently made a lot of progress (and continue to do so) to also make this available on a normal computer locally. It is already possible to create a profile from an archive that doesn't require PostgreSQL. I also have functionality (see https://github.com/aiidateam/aiida-core/pull/6303) that makes RabbitMQ completely optional. With these two steps, users can start inspecting an archive in two steps: pip install aiida-core && verdi profile setup core.sqlite_zip path/to/archive.zip.

It should be easier to browse the provenance graph

Currently, there are a few ways to browse the contents of an archive. Use the various verdi commands, such as verdi group list and verdi process list. Alternatively the QueryBuilder API can be used to query anything, but this requires coding knowledge and some understanding of the layout of the provenance graph to find what one is looking for. Finally, there is the graphical browser provided by the Materials Cloud. However, the current functionality is quite limited as in that it only ever shows a single central node, with just the first layer of inputs/outputs. It would be great if we could have a representation similar to that produced by verdi node graph generate but have it be interactive; allowing it to zoom in/out, click on particular nodes to show more info etc.

As mentioned, I think we are very close to satisfactorily addressing the first part. It will be trivial to install AiiDA and mount an archive. The biggest step I think, is to build a fantastic visual browser that can be run locally. I think that there will always remain a need for a top-level human-readable description with archives to make them understandable to others (just like with any data publications online). One thing we could consider is to allow an archive creator to include this metadata, like a description, to the archive itself.

sphuber avatar Mar 13 '24 08:03 sphuber

Hi @sphuber I fully agree that my oversimplification can be too misleading, what you suggest/anticipate goes perfectly in the right direction and "easy browsing" for a general case would of course be a great success. I will test the pip install aiida-core && verdi profile setup core.sqlite_zip path/to/archive.zip, I was not aware, this is great.

cpignedoli avatar Mar 13 '24 10:03 cpignedoli

Great comments,i agree with what has been said. I think indeed that there is some responsibility with who puts the archive online to provide minimal guidance to users. This could just be eg 1. A csv file attached to the published entry (with uuids and any other relevant info to explain what is "top level"). For instance, in the case of the verification paper, it could be columns with code name, element name, and configuration, followed by uuid. This is e ought for a human to quickly find the relevant workchain and put it on the dump command by just looking at the csv. Or 2. Provide a short script that runs a query, loops over the results, and prints such a csv or equivalent file/printout. This has the advantage of also providing a bit more context of how the archive is organised, eg if things are in groups, if extras are used to label nodes, etc (and encourages people to putt all this info in aiida and not just in an external text file). We have good examupls of such scripts currently in the Readme of some MC archive entries (I think mc2d, automated wannierisation, verification paper etc). The additional thing we can do is that we have an automated way to have these script autopopulate the first jupyter notebook cells that appear when loading the archive in, eg, renkulab. So people just need click on a button, and execute cells to start browsing.

If we agree on this and implement this, we can then update the Mc archive entries of a fee of our own repos with such scripts (in the correct format to be autoloaded), as good examples to follow. And then invest a bit in "training/communication" : eg make a blog post explaining the philosophy behind them and linking to those examples, having looking into data of others as one of the first steps of the next tutorial we will organise, etc

We can always come up in the future with more general ways to do this, but trying to find a general solution that works automatically for any archive and for people who don't know aiida is very challenging (if at all possible), while I tjjon the approach I suggest is quite practical to implement, the scripts are easy to provide for data owners and easy to use for people who just want to look into the data (again, in the template for the jupyter notebook you can both put the short querying script that prints out uuids, and also some minimal text to guide people on what to do, eg, "get the uuid from the table above for the system you are interested in, and run this command verdi workchain dump UUID to get..."

giovannipizzi avatar Mar 20 '24 16:03 giovannipizzi

Thanks @GeigerJ2 could you please update your branch, address the conflicts and make sure the tests run?

sphuber avatar Apr 10 '24 06:04 sphuber

Hi @sphuber, thanks for getting back to this here! I have rebased the branch to incorporate the recent merged changes on main (thanks to @mbercx for the assistance on this). I also resolved the conflicts that came up, so now the changes of the current individual commit that adds the dumping functionality should be nicely isolated and not be in relation to the older version when the PR was originally created.

The dumping tests are passing, however, mypy still complains about: https://github.com/qiaojunfeng/aiida-core/blob/4daa2fc0ccbca4f68df06a425a8e5c5c33b2c259/src/aiida/tools/dumping/processes.py#L467 with:

error: Incompatible types in assignment (expression has type "Process", variable has type "CalcJob")  [assignment]

Still gotta figure this one out...

GeigerJ2 avatar Apr 10 '24 14:04 GeigerJ2

error: Incompatible types in assignment (expression has type "Process", variable has type "CalcJob") [assignment]

Honestly, I think this is a case where an ignore statement is justified. We know why it is raising error but it is fine for us to do it. Don't have to waste more time on this.

That being said, you have bigger problems though :D Currently the tests are also not running. For Python 3.9 it is because you are using type annotation syntax that wasn't added until Python 3.10. See this build error: https://github.com/aiidateam/aiida-core/actions/runs/8633110347/job/23665395861?pr=6276 You essentially need to add from __future__ import annotations to the top of the file in aiida/tools/dumping/processes.py (and possibly others). Anywhere, where you use modern syntax like union operators str | None or normal basetypes like list[str] and dict[str, str].

But even if you fix that, there are 5 unit tests failing in tests/cmdline/commands/test_process.py. And last but not least, the docs build is failing. There are probably just some warnings about missing references or formatting. You can look in the logs: https://readthedocs.org/projects/aiida-core/builds/24024003/

sphuber avatar Apr 10 '24 15:04 sphuber

Yeah, I agree. At this point, it starts feeling like a bit of a waste of time. I'll add the ignore statement.

Ah, OK, well, that is good to know, at least. Wasn't really aware of that ^^

Yes, the tests in tests/cmdline/commands/test_process.py are failing because the @options.MOST_RECENT_NODE() option for verdi process status somehow got lost when resolving the merge conflicts during the rebase. I have already resolved that locally and will push it with the other changes tomorrow, so that everything hopefully and finally runs through 🚀

GeigerJ2 avatar Apr 10 '24 20:04 GeigerJ2

Alright, readthedocs build still fails, but if somebody has the time and would like to dogfood this, please be my guest :) The added commands are verdi process dump and verdi calcjob dump. @sphuber @mbercx @khsrali @superstar54 @mikibonacci @eimrek

GeigerJ2 avatar Apr 11 '24 16:04 GeigerJ2

Hey guys, I finally started dogfooding this beauty, but quite quickly ran into an issue:

❯ rm -rf *
❯ ls
❯ verdi process dump 62624
Critical: Some files present in the dumping directory. Delete manually and try again.
❯ ls
dump-PwCalculation-62624/

The dump-PwCalculation-62624 directory is empty. I double checked, and my aiida-core is at 3adf0c219.

mbercx avatar May 01 '24 13:05 mbercx

Edit: Fixed in 4657ae132.

Good catch, @mbercx, thank you! Could just reproduce this. I think I've been so used to using the --overwrite option that this bug has crept through. Shouldn't take long to fix!

GeigerJ2 avatar May 02 '24 10:05 GeigerJ2

Thanks @GeigerJ2! Can confirm this fixed the problem, am trying to see if I can break anything else. ^^

One question: was the option to write the inputs through the prepare_for_submission option ever implemented, e.g. in case I want to rerun a pw.x run and have the pseudos in the right folder?

mbercx avatar May 02 '24 10:05 mbercx

Happy to hear that :)

Yes, it was, see here: https://github.com/aiidateam/aiida-core/blob/f49a48e9773dc62b706010a18d22ebc018f06b09/src/aiida/tools/dumping/processes.py#L410-L447

However, as the implementation was very hacky and re-running a multi-step workflow isn't possible without actually re-submitting intermediate steps anyway, we shelved it for now to ensure that this feature makes it into the 2.6 release. We can still add this as an additional option later.

GeigerJ2 avatar May 02 '24 11:05 GeigerJ2