Gaffer icon indicating copy to clipboard operation
Gaffer copied to clipboard

FederatedGraphStore double-caching causes desync issues in replicated deployment

Open sw96411 opened this issue 3 years ago • 9 comments

FederatedGraphStore maintains both a uk.gov.gchq.gaffer.cache.Cache of the various Graphs managed by a federated store, and it's own private Map<FederatedAccess, Set<Graph>> named 'storage' (the latter storing deserialised objects).

There is no mechanism to ensure that the Map reflects any changes in the Cache. This could be caused, for instance, by a deployment scenario where the customer is running multiple instances of Gaffer and trying to use a shared CacheLoader implementation to share data between them. This is demonstrated (mostly. Sorta) by the unit test in https://github.com/sw96411/Gaffer/commit/15a2cdd60f8aa59bdb16674e2b963d3fe62864f8

The trivial solution is simply to abandon the use of the private Map and always deserialise the objects from the Cache on demand. I'm reluctant to suggest this, though, as it would greatly increase the cost of calling any of the methods on FederatedStore.

Other options available include refreshing the Map from the Cache at a maximum frequency (even once per second would in theory save a lot of deserialisation), writing a serial number to the cache on modification and then testing this serial number on read, or creating a specialisation of Cache which can record the last time the Cache (or part thereof) was modified and then delegating the handling of this issue off to that somehow.

I'd be happy to PR in a solution based on any of the above suggestions, but before I do, I wanted to check with the collective:

  1. Is this scenario (multiple "hot spare" instances of Gaffer collaborating around a shared cache) even meant to be supported?
  2. How does this issue interact with https://github.com/gchq/Gaffer/issues/2435? Is this effectively a duplicate?

Tagging in @GCHQDev404 as I see they have made commits to related issues and @m29827 as they were the last editor on FederatedGraphStorage.

Grateful for thoughts from any of the community, of course!

sw96411 avatar Jul 05 '21 15:07 sw96411

  1. sharing caches is not supported, hasn't been a requirement. I would need some more information in and around why a shared cache would be required between multiple Gaffer Instances.

  2. There is PR for gh-2435 (it passes locally) any work carried out on this branch likely will need to merged.

GCHQDev404 avatar Jul 05 '21 23:07 GCHQDev404

examine commit https://github.com/gchq/Gaffer/commit/1324d92d1ef20d42f2f0f3191c30be81b11e6c75

GCHQDev404 avatar Jul 05 '21 23:07 GCHQDev404

Thanks @GCHQDev404. For clarity, I think instead of "Shared Caches" I should probably have said something like "CacheLoaders with a common persistent store".

This is required whenever you have multiple instances of gaffer running over the same data for resiliency or scalability. I.e. imagine you have at least two instances of Gaffer behind a load balancer, running over the same data. A user creates a graph on instance A, and then another user queries for the list of all graphs on instance B. We want (with maybe some acceptable level of lag) the user querying on B to be aware of the graph created on A.

Again, I'm still not 100% sure this is something Gaffer actually claims to support right now, although it seems like a reasonable requirement. But might make this a feature request rather than a bugfix and hence affect priority.

Re: the included commit. Thanks v much for this, it's the kind of thing I was thinking of. One thought: If a graph has been changed, rather than added or deleted (or a graph has been deleted and then another with the same name added), then will this cause OverWritingExceptions from the checkExisting and validateExisting methods?

Sorry for not giving you a unit test for this, I'm writing this between meetings but will try to drop one in later today if it's still relevant

(ETA clarification on "Shared Caches")

sw96411 avatar Jul 06 '21 08:07 sw96411

@sw96411 Please examine the latest branch. https://github.com/gchq/Gaffer/tree/gh-2457-double-caching-issue It has 1 remaining failing test, that requires fixing. But the solution has removed the inner storage map from FederatedGraphStorage and I think is 99% complete. It also contains the Table renaming branch, this might need to be teased out before merging.

GCHQDev404 avatar Jul 11 '21 23:07 GCHQDev404

Just leaving a quick note based on our offline conversations to say that following an e2e test, this code in itself looks good, but performance issues arise, at least partly due to https://github.com/gchq/Gaffer/issues/2478. I think we suggested looking to minimise the frequency with which we call .getGraph() on GraphSerialisables in this code as part of the solution, but it's the E2E performance the matters rather than any particular solution

sw96411 avatar Jul 16 '21 12:07 sw96411

Branch has been refactored, JobTracker cache has been changed to accept a naming suffix the same as FederatedStorageCache.

Please examine https://github.com/gchq/Gaffer/compare/gh-2457-double-caching-issue

GCHQDev404 avatar Jul 16 '21 17:07 GCHQDev404

This contains many breaking changes and will need to be reviewed before merging into V1 Gaffer. likely will need to be V2.

GCHQDev404 avatar Jul 16 '21 17:07 GCHQDev404

Just a quick note to say that the code looks good and I've had a quick test against my team's dev environment. Although I wasn't able to do a full E2E test due to dependency issues, I was able to validate that the performance issues mentioned above no longer occur. I think this code is ready to enter the formal merge/test pipeline at the author's convenience

sw96411 avatar Aug 13 '21 08:08 sw96411

P.S. Thanks v much @GCHQDev404 for the excellent work on this issue :-)

sw96411 avatar Aug 13 '21 08:08 sw96411

Migration

The Old cache will not work because the name of the cache will not be found/used. Solution. delete the cache directory. turn on the FederatedStore it will be empty, re-connect to stores. Store will continue to point to persistant data. (Proxy and Accumulo.)

GCHQDev404 avatar Nov 01 '22 12:11 GCHQDev404

Closed by #2595

n3101 avatar Nov 02 '22 13:11 n3101