aiida-core icon indicating copy to clipboard operation
aiida-core copied to clipboard

Caching: ph process failed if previous pw.x process finished_ok and cached but remote folder cleaned.

Open unkcpz opened this issue 2 years ago • 46 comments

Describe the bug

This problem happened when chain the PwBaseWorkChain and PwBaseWorkChain in a Workflow and using QE plugin with caching on for aiida.calculations:quantumespresso.pw. The quantumespresso.ph CalcJob require the remote folder of quantumespresso.pw run as input, but when it pw process finishes and remote folder is cleaned, the quantumespresso.ph will continues to fail since it can not found the saved files to do the calculation.

Steps to reproduce

Steps to reproduce the behavior:

This my workflow code to calculate the phonon frequencies, which do a SCF calculation and set the scf_remote_folder for next ph process step.

class PhononFrequenciesWorkChain(WorkChain):
    """WorkChain to calculate cohisive energy of input structure"""
    @classmethod
    def define(cls, spec):
    ....

    ....
    def inspect_scf(self):
        """inspect the result of scf calculation if fail do not continue."""
        workchain = self.ctx.workchain_scf

        if not workchain.is_finished_ok:
            self.report(
                f'PwBaseWorkChain failed with exit status {workchain.exit_status}'
            )
            return self.exit_codes.ERROR_SUB_PROCESS_FAILED_SCF

        try:
            self.ctx.scf_remote_folder = workchain.outputs.remote_folder
        except NotExistentAttributeError:
            return self.exit_codes.ERROR_NO_REMOTE_FOLDER

    def run_ph(self):
        """
        set the inputs and submit ph calculation to get quantities for phonon evaluation
        """
        inputs = {
            'metadata': {
                'call_link_label': 'PH'
            },
            'ph': {
                'code': self.inputs.ph_code,
                'qpoints': self.ctx.qpoints,
                'parameters': orm.Dict(dict=self.ctx.ph_parameters),
                'parent_folder': self.ctx.scf_remote_folder,
                'metadata': {
                    'options': self.ctx.options,
                },
                # CMDLINE setting for parallelization
            },
        }

        running = self.submit(PhBaseWorkflow, **inputs)
        self.report(f'Running ph calculation pk={running.pk}')
        return ToContext(workchain_ph=running)

Expected behavior

The caching mechanism will also detect the remote folder, if it finds the required remote folder is cleaned, do not use the caching process but rerun the previous calculation.

Your environment

  • Operating system [e.g. Linux]:
  • Python version [e.g. 3.7.1]:
  • aiida-core version [e.g. 1.2.1]:

Other relevant software versions, e.g. Postres & RabbitMQ

Additional context

unkcpz avatar Oct 18 '21 10:10 unkcpz

The caching mechanism will also detect the remote folder, if it finds the required remote folder is cleaned

This check may not be so trivial to actually implement. Even if there are still some files, they may no longer be complete or have been altered. So you could do a simple check and if it is empty, but you would still miss cases where only some files have been deleted. Still, this may already catch many cases and prevent the problem you describe, so this is certainly something we can consider. Only downside I see is that this check requires opening a transport and this brings the usual problem of making sure it is properly pooled (which is not trivial) and it will of course incur a significant overhead slowing everything down. We should analyze if this is worth baking in, or whether we put the onus of all of this on the user. If they know that a calculation no longer is a valid cache because its remote is cleaned, they either remove it as non-cacheable or they temporarily disable caching while relaunching the workchain.

do not use the caching process but rerun the previous calculation.

This is also problematic. I think it is impossible to write generic code in aiida-core that will ensure that any workchain that runs a CalcJob that relies on inputs that are outputs from another CalcJob, that reruns that preceding CalcJob if its remote_folder is "dirty". This would require dynamically inserting new steps into the outline logic. If we were to add support for the caching mechanism to include a notion of a remote_folder being dirty, than I think we should put that on the level of the calculation being cached. So in this example, when the PwCalculation is being submitted, the caching mechanism should check then already if the remote_folder is still in tact, and if not, conclude the node is no longer a valid cache source.

sphuber avatar Oct 18 '21 14:10 sphuber

Thanks @sphuber !

The problem comes from that the caching mechanism doesn't know the update of the remote folder. So there will be two main ways to work around this. First, check the remote_folder before calling caching. Second, for this case, maybe I can change the hash of the process when I do the clean(). I can put it in my workflow, and if this makes sense, maybe we can implement this to _clean() method and always do this after cleaning the remote folder.

unkcpz avatar Oct 19 '21 10:10 unkcpz

First, check the remote_folder before calling caching.

Sure, but as I said, what are you going to check for? Just checking whether it is empty is not sufficient

Second, for this case, maybe I can change the hash of the process when I do the clean(). I can put it in my workflow, and if this makes sense, maybe we can implement this to _clean() method and always do this after cleaning the remote folder.

Also here, this will "fix" the case where we clean the remote from AiiDA, but typically, since these remote folders live on a scratch space, they will be automatically cleaned outside of AiiDA as well, and so then we cannot rely on the remote_folder being tagged somehow.

What I am saying is that in specific use cases there are workarounds to fix it, but I don't see an easy way forward for a generic fix.

sphuber avatar Oct 19 '21 12:10 sphuber

First, check the remote_folder before calling caching.

Sure, but as I said, what are you going to check for? Just checking whether it is empty is not sufficient

Yes, I totally agree with you. That's why I propose the following workaround.

Also here, this will "fix" the case where we clean the remote from AiiDA, but typically, since these remote folders live on a scratch space, they will be automatically cleaned outside of AiiDA as well, and so then we cannot rely on the remote_folder being tagged somehow.

Sure there are some other ways outside AiiDA will clean the remote folder. But the remote_folder._clean() will always change the remote folder therefore can be safe to say that will change the process. My point is this can be a generic fix for _clean() method. Every time we call clean() we change the input remote_folder hash so the process is not cached. The same code can be included in other operations which definitely modified the remote folder.

unkcpz avatar Oct 19 '21 14:10 unkcpz

Every time we call clean() we change the input remote_folder hash so the process is not cached. The same code can be included in other operations which definitely modified the remote folder.

I think what you mean is that if the remote_folder output node of a CalcJobNode is cleaned, than it should change (probably simply remove, as that is what currently the unofficial way of invalidating a process node as a valid cache node) of the CalcJobNode and not the RemoteData node. When considering if a calculation should be cached from, only its inputs are taken into account, not its outputs. Or am I misunderstanding you?

sphuber avatar Oct 19 '21 15:10 sphuber

I think what you mean is that if the remote_folder output node of a CalcJobNode is cleaned, than it should change (probably simply remove, as that is what currently the unofficial way of invalidating a process node as a valid cache node) of the CalcJobNode and not the RemoteData node.

Yes, that is what I mean. We should have a way to set CalcJobNode as 'unable to be used as caching node' if its RemoteData node is cleaned.

Every time we call clean() we change the input remote_folder hash so the process is not cached

I was wrong about this, remote_folder is a RemoteData output node. Thanks for clarifying this.

unkcpz avatar Oct 19 '21 15:10 unkcpz

As I mentioned earlier in the meeting, I think the best approach is that, if you are concerned by this, you should have some periodic check that goes in the relevant folders and invalidates the cache for the parent CalcJob. I think this amounts to removing the _aiida_hash extra from the node. The think we are missing on the AiiDA side and we can add is having a Node API to do this (and not go fiddling with the _aiida_* extras, whose behaviour might change in the future. So this would be good to add.

Regarding the cleaning, indeed maybe when one cleans a RemoteData node one can call also the cache invalidation of the parent; I don't know if it's better to do this at the AiiDA level (maybe an optional parameter invalidate_parent_cache=True in the clean() method, that will do nothing is there is no CalcJob caller for the RemoteData); and/or suggest to do this for the various plugins that implement cleaning logic.

giovannipizzi avatar Oct 20 '21 10:10 giovannipizzi

See #5189 for adding official API to disable a node for caching

sphuber avatar Oct 20 '21 11:10 sphuber

The other problem is when I rerun the workchain with caching off and do not do the clean. After turning on the caching back, what I expect is the new cached node will be used, but the caching mechanism still uses the old invalid cached node rather than the latest one.

So there is another 'good to have' feature for this case is that when there is more than one cached node (all the same inputs) that can be used, always use the latest one. I think this would be a more generic feature.

unkcpz avatar Oct 20 '21 11:10 unkcpz

I see. Question to discuss: is using the latest node always the 'right' thing to do? (Maybe yes, but happy to get feedback, e.g. from @greschd). Maybe this should be configurable? (I mean in 'verdi config', per profile, where maybe the default is 'pick the latest' but you have options to e.g. pick the oldest, if we think there's a usecase for that).

Otherwise, one other thing to do is to define a function to search for all nodes that have the same hash and remove the hash key for those that match certain rules (e.g.: remove the hash from all duplicates except the newest?) Or this could create problems (maybe we should do this only for CalcJob nodes and not for Data nodes?).

giovannipizzi avatar Oct 21 '21 08:10 giovannipizzi

I think the "logical" place to control this would be the is_valid_cache property, see https://aiida.readthedocs.io/projects/aiida-core/en/latest/internals/engine.html#controlling-caching

The main question is: should all calculations which have a "disappeared" RemoteData be considered invalid w.r.t. caching, or does the plugin need some control over that?

The main reason this currently can't be controlled by the plugin is because all calculations have the same node type[1] -- but it might make sense to add a hook with which CalcJob classes can modify this behavior. Either specific to this problem (analogous to the exit codes "invalidates cache" functionality), or by calling out to a method implemented on the plugin CalcJob.

greschd avatar Oct 22 '21 16:10 greschd

I remember running into this problem and spending a lot of time thinking about it without being able to find any sort of silver bullet solution. I guess the main issue could be summarized in that relying on a RemoteData node (which is one of the few data node types in AiiDA whose integrity is not guaranteed) for chaining calcjobs within a workflow make the concept of "cache-validity" relative to conditions outside of the cached calcjob. A few unfinished ideas that came to mind and may be relevant:

(0) Besides implementing possible working solutions, it would already be very helpful if there was a way to include warnings and error messages that would allow the user to realize what the problem is. I remember part of what drove me crazy was that I spend a lot of time and manual manipulation until I realized AiiDA wasn't finding my files because of this (this wasn't even in my mind since I had enabled caching a long time before running into that problem). Maybe something that, whenever you get an error in the copy step, checks if the input has some RemoteData that comes from a cached node or something like this and mentions this in the error log.

(1) I think we need some method(s) to check the integrity of RemoteData nodes. Something that could work maybe is to make these sealable and, at the time of sealing (which would be when the calcjob gets sealed), a list of files within the folder + hashes for the content of each file are stored within the node.

(2) It is only at the moment of submitting the new processes that one know what calcjobs are going to need RemoteData nodes. Thus it would be nice, instead of having to say "this old node is invalid", to be able to say "this new run I want to do can't be cached" (either at all or if the integrity check of the RemoteData fails). Does something like this currently exist? (without using global cache settings I mean)

(3) If we have a workflow that goes CJ1 -> RemoteData -> CJ2 and both CalcJobs CJ1 and CJ2 already exist, then the "whole process" should be considered cached. But within the workflow if we have the previous feature 2 and we selected CJ1 as a must redo, then we are doing at least that CalcJob unnecessarily. Moreover I'm not sure what would happen to CJ2 if it would also run again or if it would be taken from the cached node (i.e. does a different ParentFolder input imply a different hash?). For this perhaps it would also be good if the workflow designer could use a get_cached_node that would allow him to do something like:

def get_cached_calls():

    CJ1_builder = ...

    CJ1_oldnode = get_cached_node(CJ1_builder)
    if CJ1_oldnode is None:
        return None

    CJ2_builder = ...(set inputs from CJ1_oldnode)...

    CJ2_oldnode = get_cached_node(CJ2_builder)
    if CJ2_oldnode is None:
        return None

   return (CJ1_oldnode, CJ2_oldnode)

if get_cached_calls() is not None:
    # run workchain without passing the feature 2 option to force re-runs and all will be cached
else:
    # run workchain passing the feature 2 option to force re-runs on all

(4) The previous point 3 is basically running the workflow caching everything until the point where you can't cache one of the calls, in which case you go back and start re-running everything. In principle this could be automated by AiiDA without the user needing to write the method themselves since it only "imitates" the rest of the workchain. Moreover, using the sub-process call structure and some sort of recursive logic, it could even be possible to identify and separate branches of subprocesses that are ok to be re-used (since no RemoteData is needed from them) and branches that need to be re-run.

All of these points I say very lightly, but I realize each of them is actually very complex (specially 4). But it is a complex problem itself, and an important one: otherwise, I think the usefulness of caching will be reduced dramatically as workflows get bigger if there is no way to somehow be more "context-aware" when deciding if something should be re-used or re-run.

ramirezfranciscof avatar Oct 25 '21 10:10 ramirezfranciscof

Thanks @ramirezfranciscof @greschd @giovannipizzi for all the inspirational ideas. All deserve to look into.

I just open a minor PR https://github.com/aiidateam/aiida-core/pull/5197 to add the feature for caching the latest CalcJobNode, and I think that's kind of a different issue (workaround for this issue) and better to discuss separately. Very glad if you can have a look at it.

unkcpz avatar Oct 25 '21 18:10 unkcpz

Great points @ramirezfranciscof!

(1) yes, absolutely we need a way to check for RemoteData integrity

(2) there is a context manager to temporarily switch caching off, but that is a very limited solution

(3)+(4) I think that's really the crux of the problem: if we want "perfect" caching behavior with RemoteData, it's a problem that cannot be decided at the time when a calculation is (or isn't) cached. It's a function of the downstream processes whether the RemoteData will be used again. Some way of "backtracking" seems the only way to properly address this.

otherwise, I think the usefulness of caching will be reduced dramatically as workflows get bigger if there is no way to somehow be more "context-aware" when deciding if something should be re-used or re-run.

I think one way to somewhat mitigate with the current code is to design workflows such that there aren't long "chains of RemoteData". Self-contained parts which pass data along as RemoteData can still be cached as a whole, as long as all the downstream processes consume is the persisted data. To me, this makes intuitive sense: If the data is going to be used by a far-away process, it should be persisted. If we imagine solution (4) being implemented, it would still be a good design decision: It limits how many calculations would need to be re-done if a RemoteData that is being used has vanished.

does a different ParentFolder input imply a different hash

yes (haven't checked now, but from memory that's what it should do)

greschd avatar Oct 25 '21 18:10 greschd

Just wanted to say that any sort of code that is going to try and go ahead in time of the call stack to see what sub processes will be cachable is essentially not possible. Forgetting the complexity of having to parse arbitrary Python code within the workchain outline logic to see exactly what is being submitted with what inputs, it will probably be impossible to even know exactly what inputs will be used (therefore making it impossible to compute the hash) as the inputs are often created during the workflow, either as a result of a previous process or created inline. Don't want to be pessimistic, but I really doubt this will ever be feasible. The less ambitious plan of running the workflow with caching and backtracking when needed is a lot more feasible but still extremely challenging.

Instead, I would focus more on the simpler solutions suggested:

  • I like the idea of storing a representation of the contents of a RemoteData after the creating CalcJob has finished. With the new repository implementation, we could simply use the repository_metadata column to store the virtual hierarchy. This can then be used to see if the hierarchy is intact and on top of that check that the hashes of the files are still the same. My main worry is whether the overhead of this checking would be too big for it to be running by default and so maybe it should be something to be turned on explicitly for those who need it.
  • The context managers to control caching are nice to provide granular control but the main limitation is that it is not compatible with submitting to the daemons since it only applies to the current interpreter. This makes the functionality essentially useless in most cases. Maybe we can add an input to the metadata of processes that can also specify these caching configuration overrides. This would make the overrides available to all daemons and the user can specify for each subprocess which types of processes should be cached or not.

sphuber avatar Oct 25 '21 21:10 sphuber

This can then be used to see if the hierarchy is intact and on top of that check that the hashes of the files are still the same. My main worry is whether the overhead of this checking would be too big for it to be running by default and so maybe it should be something to be turned on explicitly for those who need it.

In general, checking the hierarchy should be much cheaper than computing a hash over what are often very large files. So maybe hierarchy checking can be turned on by default -- although you still have a round-trip to the remote computer in there. Since this happens when deciding if an entire calculation should be launched, it seems ok to me.

Maybe we can add an input to the metadata of processes that can also specify these caching configuration overrides.

I like that idea, but wonder if we then might need to add different kinds of configuration to really make it useful. For example excluding a specific calculation, or turning on/off the aforementioned RemoteData check. That would also allow settings defaults for this behavior in workflows. A user that doesn't want to deal with these issues (and is willing to pay some compute time for it) could globally turn on the RemoteData check.

greschd avatar Oct 25 '21 21:10 greschd

I like the idea of storing a representation of the contents of a RemoteData after the creating CalcJob has finished. With the new repository implementation, we could simply use the repository_metadata column to store the virtual hierarchy.

I don't think we should override the default behaviour of RemoteData, I feel this is feature creep for a mainly "power-user" use-case:

  1. It is perfectly feasible for there to be many 1000s of files generated in a remote folder. If we now have to store all this path information for every computation, it is going to add a lot of unnecessary bloat to the database for most use cases.
  2. Definitely do not use repository_metadata; that is intrinsically linked to the repository and treated as such by maintenance and export/import functions.

I could certainly envisage a CachedRemoteData class, which can be explicitly used for this purpose by workchains that know they need this feature, although I haven't given any thought yet to how it would be generated and the interplay between calculation and workchain. I could also envisage some way "up-front" to specifiy what files you actually want to cache, rather than blanket caching every file/folder.

chrisjsewell avatar Oct 26 '21 04:10 chrisjsewell

It is perfectly feasible for there to be many 1000s of files generated in a remote folder. If we now have to store all this path information for every computation, it is going to add a lot of unnecessary bloat to the database for most use cases.

This is the concern I also expressed, that for the problem of caching I can see this as currently the best way forward, but I am not sure if we should make this the default behavior because of the probably significant overhead, both in terms of processing power and storage.

Definitely do not use repository_metadata; that is intrinsically linked to the repository and treated as such by maintenance and export/import functions.

Maybe indeed not the best idea. What I was mostly thinking was to reuse the structure of the repository_metadata, as in the dictionary form of virtual hierarchy with (optionally) hashes of file content. We could decide to store it as an attribute.

sphuber avatar Oct 26 '21 06:10 sphuber

yep, just wanted to highlight the storage overhead (on top of the processing one) 👍

chrisjsewell avatar Oct 26 '21 06:10 chrisjsewell

To address any performance or storage issue we could just have a default behavior that stores a hash of the file hierarchy instead of the hierarchy itself. We may later want to introduce more detailed options (such as creating a hash with the file structure + file content, or having different hashes for each file), but I think these are not critical for what we need here at this point. Bah, except maybe we would want the option to enable saving the whole hierarchy just for doing sanity checks when debugging something, since I can already see silly problems being almost impossible to detect if you only have a hash to compare...

Just wanted to say that any sort of code that is going to try and go ahead in time of the call stack to see what sub processes will be cachable is essentially not possible. Forgetting the complexity of having to parse arbitrary Python code within the workchain outline logic to see exactly what is being submitted with what inputs, it will probably be impossible to even know exactly what inputs will be used (therefore making it impossible to compute the hash) as the inputs are often created during the workflow, either as a result of a previous process or created inline. Don't want to be pessimistic, but I really doubt this will ever be feasible. The less ambitious plan of running the workflow with caching and backtracking when needed is a lot more feasible but still extremely challenging.

Yeah, I agree with this, my item (3) was just an example of how someone could manually try to fix this in their own workchains (+ good practices in "workchain modulatization", like @greschd mentioned) if we had some base tools (a way to check RemoteData integrity, a way to know if a process can be stashed, a way to force re-running in specific submits). But probably not a model for how to design a more automated solution (for which I concur that we could leverage some of the already existing infrastructure).

ramirezfranciscof avatar Oct 26 '21 12:10 ramirezfranciscof

Thanks for the very interesting discussion.

Personally, because of the concerns expressed above, I would try to avoid making this very complex.

And I would avoid storing which files are in the remote data unless we're sure this can really help (BTW I agree we shouldn't store them in the repository_metadata, at worst with the same structure as attributes [even if we need to define updatable attributes for data nodes then, as these might change over time] or just as extras [might be the best approach]).

But I would still avoid implementing this at this stage:

  • it can make the DB much larger
  • I would not like that, when I submit a calculation, AiiDA automatically opens a connection and checks these files (at least at the moment, as I'm not sure the connections would be batched, so if I'm submitting thousands of jobs via the daemon, e.g. as subprocesses of workflows, AiiDA would start opening a lot of SSH connections without much control from the user). In addition, this can take some time and (while for DFT calculations that's probably OK), for calculations that take seconds, the time cost would be larger than actually re-running, especially if there are many files (and listing those can be easily very slow, even worse checking the checksum)
  • there are many valid conditions in which some files might have been removed but are not needed for restarting, and those that are needed are still there so one in principle can still use caching (e.g. one just copies over/stashes some files and brings them back some time later, but not all of them, only those actually needed).

In the end, the decision should be plugin and user specific, and I wouldn't make it at runtime. My suggestion is to discuss first what would be the workflow for a user, and implement easy things that can cover 90% of the use cases but can be implemented easily, rather than making the perfect solution that it's too complex to implement.

For instance, already thinking if removing the hash when folders are cleaned can already help in most cases?

Or creating helper functions and methods that e.g.:

  1. given some inputs, don't submit, but just return all possible calculations that are equivalent, and maybe some info on which remote data nodes would need to be checked
  2. this can be paired with some functions provided by the plugins, that can validate if a remotedata is still valid (e.g. check if certain files are there)
  3. the pairing can make sure that it checks all remotedata folders in a single connection, and cleans the caching when it discovers they are not valid anymore.
  4. also the first point of this itemised list could work in batch mode (e.g. you can return a list of remote data to check, but not giving the inputs of a calculation, but rather giving a Group, and the function will check the whole provenance of its nodes, and invalidate all RemoteData that are not good anymore).

This last one would actually be something convenient e.g. to run on a weekly (or even daily) basis. I imagine this usecase: I'm a developer, I'm using caching, and I just want to be sure I don't reuse nodes that are not valid. If cleaned remote dirs invalidate the corresponding calcs that's already good; in addition, I can run the script at point 4 above before I submit new calcs - this will take some minutes maybe (and it can skip those that have already been marked as "checked and invalid"), but after than I can just run and caching will do the right thing?

Finally, regarding suggesting not to use "too much" remote data - I'm not sure. We should probably explain the consequences to developers, but I feel that there are cases where this cannot (and maybe should not) be avoided, to avoid having too much data in the AiiDA repo, just for checkpoint files? E.g. for long-running (month-long) ab-inito MD runs, it should be easy for a user to store every, say, week, some checkpoint e.g. via stashing, but I wouldn't like the plugin to store all wave functions at every step by default. Not sure of a perfect balance, or if this can be made automatic (maybe it should be the role of the workflow on top to decide what to store and/or stash, and not the calculation, in this case?)

giovannipizzi avatar Oct 26 '21 14:10 giovannipizzi

I just encounter another problem with caching that is a little different from the initial scenario I reported here in the first place. Now, my work chain which run PwCalculation and then PhCalculation is finished ok. I assume another run of the same workchain will all use the cached node and quickly finished without any problem. But the catch is even CJ1 -> RemoteData -> CJ2 is finished I have another calculation CJ1' which is identical with CJ1 but running standalone without the subsequent calculation CJ2. The second run of the workchain tries to use the CJ1' as the caching node which makes the workchain fail since I have already cleaned the remote folder of CJ1'.

I think the "backtracking" may be the only way to solve this problem. But maybe we can have another level of 'process' just for caching purpose for the calculation like PhCalculation which always required a remote_folder as the input parameter. We hash this node also with the information of its previous step process identification like the inputs or just the hash of the previous PwCalculation?

@sphuber This is the issue I mentioned in the aiida meeting, I look into it and find it is not the problem that presubmit will run even the node was cached. Sorry about the misleading.

In this case, it is confirmed that under the current implementation using the latest node for caching is not a good idea.

unkcpz avatar Dec 15 '21 11:12 unkcpz

Thanks for double-checking @unkcpz . Good to know that it isn't the case, because that would have been a bad bug.

Before looking into backtracking, I think the more direct and simple approach is still to simply actively mark a node that should no longer be used for caching using the is_active_cache property, which I added in 7cea5bebd15737cb7ae0f4f38dc4f8ca61935ae8. I know that this is not a perfect solution, because typically one will run into at least one failed calculation to figure out that a cache source should be marked as invalid, but I think the backtracking will be particularly complex to implement.

sphuber avatar Dec 15 '21 17:12 sphuber

Thanks @sphuber. Do you think it is possible to implement a way to custom the hash for the CalcJobNode? In this case, it would be for the PhCalculation since the parent_folder always read from remote_folder of a PwCalculation. I customly hash the PhCaluclation node without using the remote_folder node but the define the hash my own way with using the hash of PwCalculation node?

unkcpz avatar Dec 15 '21 17:12 unkcpz

I am not sure I follow what you are trying to say @unkcpz . Even if it was possible to change how the hash of the PhCalculation is computed, that won't solve your problem. The problem you have is that in a workchain that runs a PwCalculation and then a PhCalculation, when the PwCalculation is executed, it is taken from the cache, but the source is not a PwCalculation that was run in that work chain before, but some other one (whose remote folder happens to have been cleaned). The decision of AiiDA to take a particular caching source for this PwCalculation is based on the hash of the PwCalculation and is completely independent of the upcoming PhCalculation. So changing the hash of the PhCalculation won't change anything. If anything you need to change the hash of the PwCalculation

sphuber avatar Dec 16 '21 12:12 sphuber

Let me try to clarify my idea, it is really a bit tricky I think. But I am pretty sure in logic this will work.

Suppose the workchain contains two calculation. In the first running, a PwCalculation and then a PhCalculation which use the remote_folder of PwCalculation as the input parent_folder. So we have: PwCJ_0:hash_0 -> remote_folder_0 -> PhCJ_0:hash_0. In principle, if we turn the caching on and run the workchain again, the correct cached nodes are used. Then we have: PwCJ_1:hash_0[from using cache] -> remote_folder_0 -> PhCJ_1:hash_0[from using cache]

But in my case, there is another standalone PwCalculation which have the same hash value but with the different remote_folder output. That is: PwCJ_1':hash_0[from scratch] -> remote_folder_1 Then the problem happened when I run the workchain the second time with the expectation that the PhCalculation will using the caching, since now the remote_folder one of the inputs of PhCalculation is changed, the caching is not used for running this PhCalculation. So in order to let the PhCalculation find the desired node, my idea is to when create the hash for Phcalulation we don't use the parent_folder but use the hash of the previous PwCalculation step as an object to generate the hash for the PhCalculation. With this modification, the second run of PhCalculation will find the caching node to use.

If it is not clear described, maybe we can set a quick meeting to discuss it? @sphuber

unkcpz avatar Dec 16 '21 14:12 unkcpz

use the hash of the previous PwCalculation step

In this case, aren't the two hashes of PwCJ_1 and PwCJ_1' the same? The "safe" option discussed previously of ignoring any PwCalculation whose RemoteFolder output no longer exists in caching should still work here -- but this obviously means that we will re-do the work even if the PhCalculation could also be cached.

greschd avatar Dec 16 '21 15:12 greschd

Thanks for the comment! @greschd

In this case, aren't the two hashes of PwCJ_1 and PwCJ_1' the same?

Yes, those are the same, this is why I think we can use the hash of PwCalculation to make a hash of PhCalculation rather than use remote_folder which will differ.

but this obviously means that we will re-do the work even if the PhCalculation could also be cached.

I do not quite understand what you mean here.

unkcpz avatar Dec 16 '21 15:12 unkcpz

I see the problem now. It seems indeed that the hash of the remote folder of the PwCalculation is different in both runs and so the PhCalculation is not hit in the cache. The reason is because of the _get_objects_to_hash:

    def _get_objects_to_hash(self) -> List[Any]:
        """Return a list of objects which should be included in the hash."""
        assert self._repository is not None, 'repository not initialised'
        top_level_module = self.__module__.split('.', 1)[0]
        try:
            version = importlib.import_module(top_level_module).__version__  # type: ignore[attr-defined]
        except (ImportError, AttributeError) as exc:
            raise exceptions.HashingError("The node's package version could not be determined") from exc
        objects = [
            version,
            {
                key: val
                for key, val in self.attributes_items()
                if key not in self._hash_ignored_attributes and key not in self._updatable_attributes  # pylint: disable=unsupported-membership-test
            },
            self._repository.hash(),
            self.computer.uuid if self.computer is not None else None
        ]
        return objects

For a RemoteData node, this returns something like:

['2.0.0a1',
 {'remote_path': '/home/sph/.scratch/f0/83/18fe-de81-449f-9e50-f98287b5f695'},
 '076c4a75b0e7f3de89d0f21d3c4fad9c09fb83054f8e6c467b875095d93fd154',
 '6a3e7df5-79a9-45f3-ad4b-a25e08af92b2']

The last hash is that of the computer and should be the same if the calculation is run on the same machine (that makes sense). The one just before it is the hash of the repository contents, but that will be empty, because it is the local repository content, which is empty by definition for a RemoteData. Then there is the remote_path attribute and this is where the problem lies. This is going to be different for remote_folder_0 and remote_folder_1. Maybe we should simply remove this from the hash for RemoteData nodes?

I am trying to see if there are cases where we should actually include it because it should mean that the hash of the associated calcjob node is different if it was run in a different folder.

sphuber avatar Dec 16 '21 15:12 sphuber

If we decide that it should not be included, the fix is easy. We simply add

    @classproperty
    def _hash_ignored_attributes(cls) -> Tuple[str, ...]:
        return super()._hash_ignored_attributes + ('remote_path',)

to the RemoteData class.

sphuber avatar Dec 16 '21 15:12 sphuber