aiida-core
aiida-core copied to clipboard
Caching: Add `Node.CACHE_VERSION` to node attributes
Recently, the hashing mechanism was changed to remove all version
information, both aiida-core
and the plugin if applicable, from the
hash computation. The reason was that with that information, each
version change in core and plugin would essentially reset the cache.
By removing the version information, updating the code would no longer
invalidate all the cache.
The downside of this change, however, is that when the implementation of
a plugin, most notably a CalcJob
plugin, changes significantly such
that a change in its attributes/inputs would really change its content,
this would no longer be reflected in the hash, which would remain
identical. This could lead to false positives when users update certain
plugins.
To give some control to plugin developers in case of changes where the
hash would have to be effectively reset, the Node
class now defines
the CACHE_VERSION
class attribute. By default this is set to None
but a Data
plugin can change it to some string value. To facilitate
the same API for process plugins, such as CalcJob
's, the
CACHE_VERSION
attribute is also added to the Process
base class.
This value is automatically added to the node's attributes, under the
key _aiida_cache_version
such that is is included in the computed hash.
This way, a plugin developer can change this attribute in their plugin
to force the hash to change.
The version has to be stored in the attributes of the node and cannot
just be taken from the plugin's class attribute directly when computing
the hash, because in that case when the hash is recalculated in the
future, a potentially different value of CACHE_VERSION
could be used
(whichever version the plugin has that is installed at that time).
@danielhollas @unkcpz I tried to implement something that would allow plugin developers to still control cache invalidation if their plugins change significantly, after we remove all version information from hash calculation. However, it is not as easy as it seemed.
Essentially, we need to allow a plugin to define some class attribute that enters in the hash computation. This has to be stored on the node somehow though and cannot just be taken from the class when the node gets stored, because otherwise in the future, when the hash of the node is recalculated, it will take the current cache version of the plugin class, which may have been updated. Since it is important information, I didn't want to store this "cache version" to the extras, since it can get lost, since they are mutable. So the attributes
are the only remaining place. However, this has all sorts of knock-on consequences: up till now, AiiDA never automatically stored "internal" values in the attributes column. This assumption is now broken and breaks all sorts of other parts of the code. Example, Dict.get_dict()
will now include this _aiida_cache_version
attribute that is automatically set. Now I can update the code to explicitly ignore this, but this is really a workaround and as we can see other tests are still failing due to this change. It is not clear to me yet whether this is acceptable, but it is a pretty significant departure from the status quo, so I am really hesitant. Not sure if there is another solution that I am missing for us to permanently store this cache version.
We have had talks in the past to have "internal" attributes that are set just by aiida-core
's internals and never by plugins. This would require a significant change to the node model and the ORM. Not really to keen to having to do that at this point, but maybe that is the only proper way of solving this issue. Thoughts?
Yeah, I realized late last night the hash recalculation will make things tricky :-( Just thinking, doesn't this also apply to the current design that stores aiida and plugin versions in the hash?
Just thinking, doesn't this also apply to the current design that stores aiida and plugin versions in the hash?
It does. Because currently the Node
class automatically adds the module version:
https://github.com/aiidateam/aiida-core/blob/1059b5f2d365e0f2ea4dea4d3c9343ce77829cfe/src/aiida/orm/nodes/caching.py#L67
which would be removed by #6215
Note that for calcjobs the version of both core and plugin is also stored in the attributes. But since that is fixed, that doesn't suffer from the same problem when rehashing. But #6215 also proposes to remove this from the hash calculation.
Note that for calcjobs the version of both core and plugin is also stored in the attributes. But since that is fixed, that doesn't suffer from the same problem when rehashing.
Sorry, I don't follow, can you expand on this part? If those versions are stored (under which key?) why is adding an extra field problematic?
Sorry, I don't follow, can you expand on this part? If those versions are stored (under which key?) why is adding an extra field problematic?
The important part here is that the behavior is different for data and process nodes. The Process
class will automatically store version information in the attributes of its ProcessNode
, see:
https://github.com/aiidateam/aiida-core/blob/69389e0387369d8437e1219487b88430b7b2e679/src/aiida/engine/processes/process.py#L713
This is fine for process nodes since those cannot be subclassed by plugins and so anyway they cannot control what is stored in the attributes. This is different for data plugins, though, where the attributes are essentially the one thing that plugins can control. As we have seen, the Dict
node for example uses the attributes to store its content. If aiida-core
now starts "polluting" this with internal attributes, that no longer gives full control to the plugin.
I see, that is subtle indeed. So if I understand correctly, the approach in this PR could work for Processes, because you could modify aiida-core to add it to ProcessNode attributes? Or does this have unintended side effects as well?
I see, that is subtle indeed. So if I understand correctly, the approach in this PR could work for Processes, because you could modify aiida-core to add it to ProcessNode attributes? Or does this have unintended side effects as well?
Hmm, perhaps we could indeed only add it to ProcessNode
's, because that is already what we are doing more or less. The problem is really only with Data
nodes. I was focused on adding it there as well, because the Data
implementation could also change. But I now remember that I already argumented a few weeks ago that arguably the implementation for Data
plugins is less important and there ultimately the hash really is defined just by the data contents. So maybe adding this cache version just to process nodes is sufficient for our use case and we avoid all the problems by polluting the Data
attributes namespace
The use case is clear to me now after a first go of reviewing. Just one comment.
I didn't want to store this "cache version" to the extras, since it can get lost, since they are mutable.
I think this can be a very straightforward implementation, the hash is already stored in the extra, I didn't see why it is a problem to not store the cache version to extra as well.
But I now remember that I already argumented a few weeks ago that arguably the implementation for Data plugins is less important and there ultimately the hash really is defined just by the data contents.
Indeed, the direct purpose is to avoid the changing of Process input/output interfaces that will bring the false positive.
Yeah, I realized late last night the hash recalculation will make things tricky :-(
@danielhollas If I understand correctly, there will be no hash recalculation, if the plugin update the cache version, the new ProcessNode will just has different hashing and the old ProcessNode
is unreachable by new calculations.
If roll-back the plugin version, and the old ProcessNode
will because available again.
Yeah, that makes a lot of sense! I guess I don't even understand how caching works for Data node, or what even would be a purpose of that? I thought of caching as always tied to Processes?
By default this is set to None but a Data plugin can change it to some string value.
Any specific reason for this to be a string? An int
seems simpler and more straightforward to think about, evicting the cache would simply be bumping it by +1.
I think this can be a very straightforward implementation, the hash is already stored in the extra, I didn't see why it is a problem to not store the cache version to extra as well.
Well, there's a difference in consequences when these extras are dropped. If the hash is dropped, the node should simply not be available for caching (right?) so you'll get a false negative for cache hit. But if the cache version somehow got dropped, then you could get false positive cache hits (but I guess only after the hash would also be recalculated?). These considerations do seem a bit academical. At the same time, adding it to Process attributes does not seem any more difficult in terms of implementation?
@danielhollas If I understand correctly, there will be no hash recalculation
Sure, but it's possible that due to DB migration induced by AiIDA-core, we would ask users to recalculate their hashes manually (as we did last time, although whether that's necessary is up to discussion I guess).
Yeah, that makes a lot of sense! I guess I don't even understand how caching works for Data node, or what even would be a purpose of that? I thought of caching as always tied to Processes?
Data nodes are not really cached in that sense, they just have a hash which is used in the computation of the process' hash.
Any specific reason for this to be a string? An int seems simpler and more straightforward to think about, evicting the cache would simply be bumping it by +1.
No particular reason. Could even accept either. For now I haven't really thought about limitations/validation of this class attribute. First wanted to see if the generic concept would be possible.
If I understand correctly, there will be no hash recalculation, if the plugin update the cache version, the new ProcessNode will just has different hashing and the old ProcessNode is unreachable by new calculations. If roll-back the plugin version, and the old ProcessNode will because available again.
@unkcpz The problem here as Daniel has pointed out is that the public API (and CLI as well) have methods to clear the hash of the extras. And on top of that, the hash is just stored in the extras, which can be (accidentally) deleted. If at that point you have to recalculate the hash, you better make sure you have all the data that went in the original hash. So we need to permanently store this specific cache version (and cannot rely on extras as it can be accidentally removed).
I have updated the implementation to just change the behavior for Process
plugins. They can set the CACHE_VERSION
class attribute, which will get set on the ProcessNode
in the version.cache
attribute. For now I put it there as it integrated nicely with the other version information of core and the plugin. This would just force @unkcpz 's implementation to change slightly as now he can no longer drop the version
attribute in its entirety from the objects to go in the hash, but the version.cache
subkey needs to be kept. Alternative would be to not nest it in the version
attribute, but have it top-level in version_cache
or something like that.
I briefly went through the implementation and mostly makes sense to me.
Now it got me thinking, should aiida-core use the same mechanism for invalidating caches, instead of forcing a DB migration? E.g. instead of putting a aiida version itself in the cache, have CACHE_VERSION defined somewhere (effectively globally). DB migration is a chore (I suspect), but the biggest issue is that it is not reversible.
Now it got me thinking, should aiida-core use the same mechanism for invalidating caches, instead of forcing a DB migration? E.g. instead of putting a aiida version itself in the cache, have CACHE_VERSION defined somewhere (effectively globally). DB migration is a chore (I suspect), but the biggest issue is that it is not reversible.
Very interesting idea. I think this might actually work and would indeed get rid of having to add a database migration to reset the cache in case the implementation changes. I don't see any immediate problems with it, but as always, the devil is in the details with caching. I think the best thing would be to simply try so I think I will give it a shot and see what comes up.
Thought the idea over a bit. Although it should properly "invalidate" existing caches since the hash of future processes with identical inputs will now be different, due to the changed CACHE_VERSION
attribute, the stored hash of existing nodes will remain and be incorrect. It is therefore feasible, albeit very unlikely, that a future process can produce a hash that is identical to an existing one despite having different inputs. I guess then that this approach might be pragmatically functional but theoretically it is incorrect and leaves margin for false positives.
We might say that this theoretical problem is so unlikely that it doesn't really justify the inconvenience that migrations add. However, if we were to accept this solution, we would go back to the previous problem where we would have to store this global cache version somewhere on a node when it gets stored so later its hash can be recomputed. But with the current database layout, that could only be stored in the attributes, and that gives problems for the Dict
class, and potentially other Data
plugins. So we are back to square one.
I am therefore leaning to just keeping database migrations to invalidate hashes if aiida-core
significantly updates the hashing algorithm, and in this PR we add the capability for plugin developers to invalidate the hashes of process plugins (i.e. effectively CalcJob
s).
pinging @danielhollas
Sorry, on a conference this week. Need to think this through a bit, I'll try to reply on Wednesday if I am able. By Friday the latest.
Although it should properly "invalidate" existing caches since the hash of future processes with identical inputs will now be different, due to the changed CACHE_VERSION attribute, the stored hash of existing nodes will remain and be incorrect.
Hm, I am not sure I fully follow this scenario. What exactly do you mean by "incorrect" in this case? It seems to me that the hash will be consistent with the CACHE_VERSION that it was created with, no?
It is therefore feasible, albeit very unlikely, that a future process can produce a hash that is identical to an existing one despite having different inputs.
This seems to describe a hash collision? Those should be vanishingly unlikely right? I mean you can get a collision regardless of this scheme by pure bad luck? But I might be misunderstanding what you're trying to say.
However, if we were to accept this solution, we would go back to the previous problem where we would have to store this global cache version somewhere on a node when it gets stored so later its hash can be recomputed. But with the current database layout, that could only be stored in the attributes, and that gives problems for the Dict class, and potentially other Data plugins. So we are back to square one.
I thought that above we decided we don't really have to worry about the Data nodes, since they are defined by construction by their hash? I guess you're right that in case of aiida-core it might not be the same situation, if the core hashing algorithm is somehow changed.
I am therefore leaning to just keeping database migrations to invalidate hashes if aiida-core significantly updates the hashing algorithm, and in this PR we add the capability for plugin developers to invalidate the hashes of process plugins (i.e. effectively CalcJobs).
I think for this PR let's go with that. We can always decide to rethink/improve on the aiida-core later I guess.
I agree that hash collisions are always a risk, but since they are inherent to hashing algorithms and extremely unlikely, we shouldn't worry about them. However, what I was trying to sketch out, is that hashing collisions are to be expected and fine as long as your hashing algorithm doesn't change. But here, we were talking about the scenario where the hashing implementation in aiida-core
changes. It is a bit more complicated in aiida-core
because our algorithm is composed of the actual hashing algorithm (currently blake2b) and the selection of data of the node that actually goes into the hash. Currently, we are mostly and have ever changed the latter part, but we could imagine changing the actual hashing algorithm.
What I was thinking in my comment above is that hash collissions are acceptable as long as you stay within the same algorithm. However, when we change the objects to hash, we are changing the algorithm and then collisions risks no longer follow the rules of the hashing algorithm that lies underneath, right?
Anyway, there is a more practical objection I think. Let's go back to why we want to introduce some mechanism to invalidate hashes. The reason we have been removing hash extras through database migrations is not to prevent false positives but rather to make sure that caching keeps working (we just didn't think that including the versions would fully undo this work 😅 ). If the hashing algorithm changes, the computed hash for a data node with attributes {x}
would be different. A calculation that has that node as input will therefore not be cached, even though it should because the data is exactly the same. This is why we were resetting hashes, to make sure that old data nodes are properly matched with the new hashing algorithm. So the reset is not to invalidate the existing data node hashes, but to make sure they are up to date.
This is exactly the opposite motiviation for the feature we are discussing adding here, which is to give control to calcjob plugins to invalidate their hash if their implementation changes such that the outputs are expected to change even for identical inputs.
Conclusion: I think we should keep using database migrations if aiida-core
changes the hashing algorithm to ensure that existing data nodes remain valid cache sources, but also provide a CACHE_VERSION
attribute for CalcJob
plugin writers to invalidate existing nodes.
This is exactly the opposite motiviation for the feature we are discussing adding here, which is to give control to calcjob plugins to invalidate their hash if their implementation changes such that the outputs are expected to change even for identical inputs.
That's a very good point. :+1: Agreed.
I am not sure what is the state of the current PR, feel free to ping me for review.
I am not sure what is the state of the current PR, feel free to ping me for review.
I think the current summary of the situation is as follows:
- Including the version (core and plugin) in the hash computation makes caching effectively useless, so we remove it (PR #6215)
- Removing the version, however, runs the risk of false positives when
CalcJob
plugin implementations change significantly. - Therefore an attribute is added to the
CalcJob
class that is added to objects included in the hash. This attribute can be changed by plugin implementations to effectively reset the cache. (This PR) - Whenever the hashing algorithm in
aiida-core
is changed, a database migration is necessary to reset all hashes and recompute them. Without it, the existing cache would essentially be unusable.
This is currently implemented by PR #6215 and this PR. So I think this is ready for review.
I just thought of one more potential complication. Adding a cache version for the CalcJob
class may not be sufficient. The outputs of a calcjob are really the generation of inputs files from input nodes through the CalcJob
as well as the creation of the outputs through the Parser
. So far we never included the parser version separately because the argument was that typically that would be part of the same plugin package as the CalcJob
and the package version was included in the hash calculation. This is now getting removed, but we are only putting the version of the CalcJob
back in. I think we should add a similar attribute for the Parser
.
I think we should add a similar attribute for the Parser.
Yeah, unfortunate but reasonable.
At some point I was also thinking about if we could somehow detect Parser/CalcJob changes automatically via some code introspection. But even if we were able to do that, it would kind of defeat the purpose since even non-functional changes would affect the caching behaviour.
@danielhollas @unkcpz I have now implemented the final state of the discussion. It essentially adds the CACHE_VERSION
class attributes to the Process
and Parser
classes. Plugin developers can now change this in their CalcJob
and Parser
plugins to reset the cache. If these versions are set, they will be written to the CalcJobNode
attributes together with the version of aiida-core
and the plugin package, i.e. node.base.attributes.get('version')
would return something like:
{
'core': '2.6.0',
'plugin': '3.0.0',
'cache': 'some-value', # Value of `CalcJob.CACHE_VERSION`
'parser': 'other-value', # Value of `Parser.CACHE_VERSION`
}
Some open questions/action points:
- The
CACHE_VERSION
is currently added to theProcess
base class, but really would probably only be used by theCalcJob
class. Technically, calcfunctions are also considered for caching since they are calculation processes and not workflows, but there the process classFunctionProcess
is created dynamically and so the developer cannot set theCACHE_VERSION
. But then again they wouldn't have to because the function source code is included in the hash calculation I think, so changing it would anyway already invalidate the code. So maybe it would make more sense to moveCACHE_VERSION
to theCalcJob
? Less "correct" in the generic sense, but certainly makes a whole lot more sense practically speaking. - The key
cache
for theCalcJob.CACHE_VERSION
also makes little sense. If we agree that we move the attribute from theProcess
to theCalcJob
class, then just naming thiscalc_job_cache_version
and changing theparser
toparser_cache_version
would be clearest, right? - Currently there is no type-checking on the
CACHE_VERSION
. I have type checked it asstr | None
. I don't see a reason why we should be restrictive in principle. Unless this unrestrictiveness actually makes it more difficult for the user to adopt a sensible versioning strategy? I am not sure what the best approach would be here. - These changes would still require #6215 to be changed. Currently that drops the entire
version
attribute dictionary from the hash calculation, but it would have to keep thecache
andparser
keys. That is easy enough to do though once we sign off on the concept. - Once signed off, I will add documentation of course.
So maybe it would make more sense to move CACHE_VERSION to the CalcJob?
+1 If we realize we need it to be more generic, we can always lift it up the class chain.
then just naming this calc_job_cache_version and changing the parser to parser_cache_version would be clearest, right?
Seems ok to me.
Currently there is no type-checking on the CACHE_VERSION. I have type checked it as str | None. I don't see a reason why we should be restrictive in principle. Unless this unrestrictiveness actually makes it more difficult for the user to adopt a sensible versioning strategy? I am not sure what the best approach would be here.
I would vote to make it an int
to make it clear that any versioning strategy other then "bump this integer by one" doesn't make sense since the purpose of this version is essentially only to invalidate the cache (e.g. semver doesn't make sense). I don't have a strong opinion about whether we should type-check this at runtime.
Currently that drops the entire version attribute dictionary from the hash calculation, but it would have to keep the cache and parser keys. That is easy enough to do though once we sign off on the concept.
I am wondering if it would be clearer to have a separate attribute "cache_version" or something like that. That would make it easier to see that we do not hash anything inside version
, and the current code in #6215 would work as is and wouldn't need to get more complicated.
I would vote to make it an int to make it clear that any versioning strategy other then "bump this integer by one" doesn't make sense since the purpose of this version is essentially only to invalidate the cache (e.g. semver doesn't make sense). I don't have a strong opinion about whether we should type-check this at runtime.
Done. For now there is no validation though. I am not sure we need it.
I am wondering if it would be clearer to have a separate attribute "cache_version" or something like that. That would make it easier to see that we do not hash anything inside version, and the current code in https://github.com/aiidateam/aiida-core/pull/6215 would work as is and wouldn't need to get more complicated.
Very good point, I like it. I have implemented it.
@danielhollas this should be done for final review. I have added the docs as well