aiida-core icon indicating copy to clipboard operation
aiida-core copied to clipboard

Loose the hashing computing without include core version

Open unkcpz opened this issue 1 year ago • 26 comments

fixes #6214

I remove the redundant version and I still keep the major version (TBD), maybe we can remove the core version.

unkcpz avatar Dec 08 '23 13:12 unkcpz

This change makes a lot of sense, would be good to get into 2.5.0 if @sphuber agrees.

danielhollas avatar Dec 11 '23 11:12 danielhollas

I think the idea is definitely interesting and we probably want to apply it, however, I think we need a bit more time to think about any issues related to backwards compatibility. We might want to add a database migration to update all existing nodes. After internal discussion, we slated this most likely for the release following v2.5.

sphuber avatar Dec 11 '23 12:12 sphuber

Hi @sphuber, I rebase this PR. You can give it a look and we can then find time to discuss. I think it is this https://github.com/aiidateam/aiida-core/pull/6000 PR include the migration required. We should do it separately or in this PR?

unkcpz avatar Jan 11 '24 12:01 unkcpz

I start to need this change since I update the aiida-core for my production environment and nothing is now cached. Any idea on how to proceed with it? @sphuber

unkcpz avatar Feb 08 '24 16:02 unkcpz

I start to need this change since I update the aiida-core for my production environment and nothing is now cached. Any idea on how to proceed with it? @sphuber

To me it is not clear whether this is the right solution and whether it doesn't introduce problems. For starters, if the version is a problem, why keep the major version. If the real thing that matters in caching is the hash of a node, and the hash is simply defined by its content, then if the content doesn't change, the hash also shouldn't change, even if the aiida-core version that created it changes.

Furthermore, even if we change the core version to just include the major version, the plugin version is still there. That might also change often and you would still have the same problem. This would now push it down to plugins to update their Data plugins to manually exclude that attribute from _get_objects_to_hash. If we think the version is not needed in the hash and really is just making caching useless in many cases, shouldn't we just get rid of it entirely?

And if the answer is yes, I don't think the current solution is the right one. Why change the actual versions that are stored (or even remove them)? This way we are losing valuable information. When we can simply modify NodeCaching._get_objects_to_hash to exclude this key. This would have the same effect, without any information loss.

Finally, there is the question whether we need a migration if we go through with this. We already changed hashing in v2.4 and added a migration for it. This was the 1st migration since v2.0, and we might have to add it once again in the next release and drop the hashes once again. Not saying we shouldn't do it therefore, but it is something to keep in mind, as it will make it more difficult for users to switch between this new version and older ones.

So I think some discussion is still required before we can go on with this. Thoughts?

sphuber avatar Feb 08 '24 18:02 sphuber

After the discussion with @sphuber. We agree to completely remove the aiida-core version from computing hashing.

In the PR, I didn't remove the plugin version from caching, and I think at the moment we'd better still keep it there. Mentioned by @GeigerJ2, it can be false positive caching that happened if the plugin change the implementation and the calcjob give the different outputs. But inside the workchain, the cached calcjob node is taken and fail the workchain. I remember I already did encountered the same issue during the plugin development, so even when plugin version exist it can not help to solve this issue during development. But we need to prevent this happened to the users of plugin by include the plugin version so once the new version of plugin released it is a calcjob that won't get from old sourced one. We can keep on loosing this strict by include in the future the minor version or only the major version of the plugin. For now, I assume fully remove the aiida-core version from the hashing is already a bold move.

unkcpz avatar Feb 22 '24 16:02 unkcpz

Before I forget, another thing we might want to change while we are chaning the hashing algorithm: it would be useful if _get_objects_to_hash would return a dictionary instead of a list. Sometimes to debug hashing it is useful to look at the value returned by this method, but it being a list makes it difficult to identify what each value represents. If we turn it into a dictionary, that would make it easier to identify each part

sphuber avatar Feb 23 '24 08:02 sphuber

I am still not a 100% clear about the hypothetical case of the plugin being updated needing the cache to be invalidated (by the hash being different). Wouldn't the same reasoning also not apply to aiida-core? What if the CalcJob base class changes some small (but significant) part of the implementation? Wouldn't that be the same scenario? So why keep the plugin version but drop the core version in that case? And if the changes could happen due to a fix of a bug (which could be released with a patch version of aiida-core) then just checking the major or even minor version also won't protect against this hypothetical case. So it seems we are back to square 1. Yes, keeping the core and plugin versions in the hash might be the safest, but how common are the cases that it is trying to protect against? If they are rare enough, should we really invalidate hashes with each (patch) version update, strongly reducing the effectiveness of the caching mechanism?

I have the same question as @sphuber. Ignoring aiida-core version but invalidating cache upon plugin update seems very strange. Especially since plugins might undergo more rapid development. It does make sense in a sense that we have tight control over what we do in aiida-core, and if we did make some breaking change, we can always drop the hashes via DB migration. This would seem more difficult to handle for plugins (not to mention that plugin developers might not be aware of caching subtleties).

In the PR, I didn't remove the plugin version from caching, and I think at the moment we'd better still keep it there. Mentioned by @GeigerJ2, it can be false positive caching that happened if the plugin change the implementation and the calcjob give the different outputs. But inside the workchain, the cached calcjob node is taken and fail the workchain. I remember I already did encountered the same issue during the plugin development, so even when plugin version exist it can not help to solve this issue during development.

@unkcpz @GeigerJ2 It would be very useful if you could write down couple such scenarios in detail. What are the exact conditions for the breakage? What would actually happen when there is a breaking change? Would the whole workflow crash or silently do a wrong thing? Based on that I wonder if we could have some mitigation strategies in aiida-core to prevent these issues.

danielhollas avatar Feb 23 '24 14:02 danielhollas

@unkcpz @GeigerJ2 It would be very useful if you could write down couple such scenarios in detail. What are the exact conditions for the breakage?

The example I did encounter is I run a workchain of CALCJOB_A -> Dict_FORMAT_A -> CALCJOB_B. The calcjob_a will output a dict and I use the value from the dict as the input for calcjob_b (there can be some calcfunction in the middle for provenance, but doesn't matter for the moment).

  • I run the workchain and CALCJOB_A get cached but for some reason the CALCJOB_B not successfully done therefore will not be the valid cached source.
  • Let's assume now I change the output format of CALCJOB_A so the format it is different with containing more items, and the newly included items is used for CALCJOB_B.
  • I rerun the workchain, I expect that the CALCJOB_A runs and generate new format but it actually take the old calcjob_a from cache and give the old format.

unkcpz avatar Feb 23 '24 14:02 unkcpz

I see, that would indeed fail. There is a deeper problem here. The real cause for the discrepancy in this scenario is the Parser. Although this can (and for now almost always practically is) part of the same plugin package as the CalcJob, it doesn't have to be. But I admit that in practice this is unlikely as usually the calcjob and parser live in the same package.

Still, in this situation, one could "invalidate" the original CALCJOB_A node so it won't be used for the cache anymore. How common are these scenarios? Would it not be more reasonable to request users invalidate these old nodes if the plugins have changed significantly, rather than making caching practically ineffective by default across the board?

sphuber avatar Feb 23 '24 15:02 sphuber

Still, in this situation, one could "invalidate" the original CALCJOB_A node so it won't be used for the cache anymore. How common are these scenarios? Would it not be more reasonable to request users invalidate these old nodes if the plugins have changed significantly, rather than making caching practically ineffective by default across the board?

This sounds reasonable as long as this is thoroughly documented. How does one currently invalidate a cache for a single node? Is it possible to do through CLI? If not, we should make it as easy as possible. One disadvantage I see that the user might not be aware that the caching was the reason for the failure? Is there anything that could be done to make that more apparent?

(this is just thinking out loud. I haven't used caching myself yet so perhaps these are non-issues)

danielhollas avatar Feb 23 '24 15:02 danielhollas

No yet, I planed it in https://github.com/aiidateam/aiida-core/pull/6096 but not finished it.

unkcpz avatar Feb 23 '24 15:02 unkcpz

How does one currently invalidate a cache for a single node?

As @unkcpz said, there isn't a CLI endpoint yet but he has a PoC. In the Python API, you literally just do load_node().is_valid_cache = False. Once the CLI end point is there, I think it should be relatively easy to control for users.

One disadvantage I see that the user might not be aware that the caching was the reason for the failure? Is there anything that could be done to make that more apparent?

Not sure, I doubt it though. If you think about @unkcpz example, a KeyError or AttributeError would probably be raised when the workchain code tries to access an attribute in the Dict node that doesn't exist. Should the aiida-core code than try to catch this exception and start inspecting the code that raises it, to see if it was operating on an output node of a CalcJobNode that was cached? Seems impossible ^^

I did recently improve the verdi process list output to show by default whether a process was taken from the cache or not. So if the workchain fails and they check that, they would see that the last calculation was cached, which might give a hint at least. Not sure how much more we could do. It surely won't be immediately obvious to new users that are not familiar with caching.

sphuber avatar Feb 23 '24 15:02 sphuber

Maybe we can take this as action. We remove both core and plugin versions. Since the caching is not the default config, so the user will not encountered any issue caused by the false positive caching. If user try to enable the caching, we pop up a warning with links sort of like that "SHOULD READ BEFORE USING CACHING".

But sure, adding the CLI to the caching is also an important thing come to rescue if wired thing happened with caching.

unkcpz avatar Feb 23 '24 16:02 unkcpz

If user try to enable the caching, we pop up a warning with links sort of like that "SHOULD READ BEFORE USING CACHING".

I like that idea, we can link to the "Limitations and guidelines" section

https://aiida.readthedocs.io/projects/aiida-core/en/latest/topics/provenance/caching.html#limitations-and-guidelines

Note that this section needs to be updated if we go forward with this PR.

The real cause for the discrepancy in this scenario is the Parser. Although this can (and for now almost always practically is) part of the same plugin package as the CalcJob, it doesn't have to be. But I admit that in practice this is unlikely as usually the calcjob and parser live in the same package.

I don't know the CalcJob internals well, but I wonder if there's anything currently not included in the hash that would decrease the chance of a bad cache hit (i.e. something more fine-grained than the plugin version?).

danielhollas avatar Feb 23 '24 16:02 danielhollas

I don't know the CalcJob internals well, but I wonder if there's anything currently not included in the hash that would decrease the chance of a bad cache hit (i.e. something more fine-grained than the plugin version?).

I don't think so, you can take a look (better to be a calcjob submit to remote using non-direct scheduler which may contains more attributes) by checking the node.base.attributes, those items exclude the node._updatable_attributes are used for the hashing computing.

unkcpz avatar Feb 23 '24 17:02 unkcpz

Do you want to update this PR @unkcpz and remove the version information entirely from the hash calculation? We have given ample opportunity to discuss and I think we are converging on the conclusion that it makes caching ineffective and we are willing to take the slight increase in risk of false positives.

I would like to include these changes in the upcoming release. I just opened #6321 which fixes a bug in caching. It would be good to include all these changes in one go with a database migration that drops all hashes.

sphuber avatar Mar 17 '24 14:03 sphuber

Would be great to have this, I've just started using caching and it is perfect for our use case.

How common are these scenarios? Would it not be more reasonable to request users invalidate these old nodes if the plugins have changed significantly, rather than making caching practically ineffective by default across the board?

I have to say that I am still uncomfortable about passing this burden on users, and would like us to think harder to make these cases less probable since caching bugs are notoriously hard to debug.

The scenario pointed by @unkcpz could be traced to the change in the parser. I am wondering if we could, instead of just copying the output nodes, we could run the current parser on top of the finished calculation?

Another scenario would be if the plugin changed the implementation of the function that prepares the input files for the QM program (prepare_submission?). I could imagine e.g. plugin changing some default parameters of the ab initio calculations. Is this something we could capture somehow, e.g. by running the function and comparing the input files with the ones from the cached nodes?

Another, somewhat orthogonal question -- in aiida-core we can do a DB migration and drop the hashes if the implementation changes. Do plugin authors have any possibility of invalidating the cache in a similar manner, if the plugin version is no longer part of the cache?

danielhollas avatar Mar 17 '24 17:03 danielhollas

The scenario pointed by @unkcpz could be traced to the change in the parse. I am wondering if we could, instead of just copying the output nodes, we could run the current parser on top of the finished calculation?

This is unfortunately not really an option, because the parser could rely on files that are retrieved with the retrieve_temporary_list but these files are not stored in the retrieved node and so cannot be provided to the parser.

Another scenario would be if the plugin changed the implementation of the function that prepares the input files for the QM program (prepare_submission?). I could imagine e.g. plugin changing some default parameters of the ab initio calculations. Is this something we could capture somehow, e.g. by running the function and comparing the input files with the ones from the cached nodes?

This would become very slow though. The whole premise of the caching mechanism is that to match nodes, a query is performed based on their hash, which is computed just once when the node is stored. With your suggestion, you would now have to find all nodes that match the hash, and then actually one-by-one start running the prepare_for_submission and then manually checking all input files, comparing them. This would be a huge performance hit. And then we are still not a 100% sure that we capture everything, because just checking the input files generated by prepare_for_submission is not everything that determines the conditions for a calculation. So I would definitely not go in this direction.

Another, somewhat orthogonal question -- in aiida-core we can do a DB migration and drop the hashes if the implementation changes. Do plugin authors have any possibility of invalidating the cache in a similar manner, if the plugin version is no longer part of the cache?

No they cannot, only aiida-core can add migrations. Users would have to essentially perform a query and invalidate those nodes, e.g. something like

for node, in QueryBuilder().append(PwCalculation).iterall():
    node.is_valid_cache = False

sphuber avatar Mar 17 '24 19:03 sphuber

Hi @sphuber, I am going to finish this one today. So in the end, we agree on remove all versions from caching correct? If so this PR would be then a tiny one. Also mentioned in https://github.com/aiidateam/aiida-core/pull/6215#issuecomment-1961608709, do we require the change on the prompt when activating the caching?

unkcpz avatar Mar 18 '24 11:03 unkcpz

I am going to finish this one today. So in the end, we agree on remove all versions from caching correct?

I did suggest to go on with removing all versions from the hash, but then @danielhollas voiced concerns. I responded to his suggestions, explaining why I think they are not feasible. Not sure if those counter comments were satisfactory.

Also mentioned in #6215 (comment), do we require the change on the prompt when activating the caching?

Not sure to be honest. It would require a special exception in Config.set_option which feels a bit dirty. I think this should simply be part of the caching documentation.

sphuber avatar Mar 18 '24 12:03 sphuber

I did suggest to go on with removing all versions from the hash, but then @danielhollas voiced concerns. I responded to his suggestions, explaining why I think they are not feasible. Not sure if those counter comments were satisfactory.

I am fine with the proposed change, the purpose of voicing concerns was to make us think a bit harder about how to make the situation better. :-) I was aware that the proposed ideas were not going fly probably, but was hoping they would trigger some ideas from @sphuber who is knowledgeable about implementation details. :-)

tl;dr @unkcpz please go ahead with the implementation in this PR. Also finishing the CLI caching control in #6096 would be great. Any further work better be done separately anyway.

The whole premise of the caching mechanism is that to match nodes, a query is performed based on their hash, which is computed just once when the node is stored. With your suggestion, you would now have to find all nodes that match the hash, and then actually one-by-one start running the prepare_for_submission and then manually checking all input files, comparing them. This would be a huge performance hit.

I think the biggest argument against this is the implementation complexity. I wonder if there is some middle ground that could catch most issue (e.g. just checking the default input file?).

I am not convinced that this would be a big performance hit. In most cases, the prepare_submission would be only called on a single node that matched, and its cost should be negligible to the cost of actual QM calculation?

(perhaps we should move this discussion to Discourse and focus on the implementation in this PR)

danielhollas avatar Mar 18 '24 13:03 danielhollas

tl;dr @unkcpz please go ahead with the implementation in this PR.

👍 Will do it.

In most cases, the prepare_submission would be only called on a single node that matched

I think I sort of agree with @danielhollas and think this would bring a bit robust to the caching mechanism. In the end it may not have too many cached nodes found.

unkcpz avatar Mar 18 '24 15:03 unkcpz

Hi @sphuber, I think this is ready for the final review.

unkcpz avatar Mar 19 '24 12:03 unkcpz

Thanks @unkcpz . I gave the changes a quick look and they look ok to me. I would just first want to make sure that what I suggested in this comment would provide the necessary control to plugin developers to essentially reset the cache if the implementation of a Data or CalcJob plugin changes significantly. I will try to make a PR tonight/tomorrow and then @danielhollas and yourself can maybe take a look at it. If that seems sufficient, we can merge both changes.

sphuber avatar Mar 19 '24 21:03 sphuber

I just opened #6328 as a draft PR that attempts an implementation, but as I describe in detail there, it seems not so easy after all...

sphuber avatar Mar 20 '24 08:03 sphuber

Thanks for taking care of this @sphuber and sorry for not active aiida stuff this month.

unkcpz avatar Apr 17 '24 09:04 unkcpz