olric Read Repair vs healing to satisfy ReplicaCount

During an upgrade of my application, which uses olric embedded, all replicas get restarted in quick succession. Most keys will never be read, or will be read very infrequently ... which with read repair means that after an upgrade of my application the cache will be immediately "empty".

That behavior means that olric doesn't actually provide useful functionality for my use case - unless I add code that reads the entire keyspace after startup to effectively repair the cache redundancy before letting Kubernetes know that the Pod is started successfully.

I believe that olric could do this internally more efficiently and I also believe that such a functionality would be generally useful:

From the documentation it seems that olric could "easily" know that a part of the keyspace doesn't satisfy the requested ReplicaCount and actively transfer the data to the newly joined member to repair the cache in case a node restarts.

So this is a request for:

when joining a cluster, ask it to transfer some data to the new node to satisfy ReplicaCount
provide an API to detect when this initial sync is finished so that the embedding application can communicate to the Kubernetes API when it is safe to continue with the rollout
detect node joins/departures fast enough to make such a rollout fast enough
useful node identity in the context of a Kubernetes cluster where IP-addresses are basically useless

Sep 26 '20 13:09 cobexer

Having same issue here. I have to to do a full read repair of all keys to trigger data transfer when new node joins. Do we know what version will we address this issue?

Dec 18 '20 19:12 wliuroku

Having same issue

Aug 30 '21 11:08 hacktmz

Hi all,

I'm aware of this is one of the most wanted features among the users. I started working on a solution based on a technique called vector clock. It may be ready for initial tests in a couple of months. I plan to make it production-ready by the end of this year.

For anyone who is curious about version vectors, here is some info:

https://riak.com/posts/technical/vector-clocks-revisited/index.html?p=9545.html
https://haslab.wordpress.com/2011/07/08/version-vectors-are-not-vector-clocks/
https://en.wikipedia.org/wiki/Vector_clock
https://github.com/hazelcast/hazelcast/blob/master/docs/design/partitioning/03-fine-grained-anti-entropy-mechanism.md
https://people.cs.rutgers.edu/~pxk/417/notes/logical-clocks.html

Aug 30 '21 12:08 buraksezer

olric olric copied to clipboard

Read Repair vs healing to satisfy ReplicaCount

olric
olric copied to clipboard