java-operator-sdk icon indicating copy to clipboard operation
java-operator-sdk copied to clipboard

Limited Cache Sizes

Open csviri opened this issue 3 years ago • 4 comments

Currently informers and other provided event source implementations cache all the received resources. This could lead to very high memory consumption. In that case what could be done is limit the cache sizes and/or how long a resource is cached.

Ideally an approach for cache limiting like one supported by caffein could be quite efficient. Like evicting the objects which were not accessed (like read or write not happened lately) for a longer time will be evicted.

To make this transparent we need a more extended layer over EventSource, where there is a API to reading the object either from the cahce (if the object is there) or from the target API itself.

csviri avatar Jan 30 '22 20:01 csviri

todo check the options for controller runtime: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/cache#Options

csviri avatar Feb 01 '22 16:02 csviri

If you clean up cache entries, I think you will be in the same state as at startup, that is being in an unknown previously observed state. It may lead to the issue of not triggering event on deleted resource. So the cache entry should be replaced with something lighter that will always trigger a reconcile or be enough smart to act as the real value regarding equals.

For eg, when the cache evict a k8s resource of an informer, it can replaced by just its resource ID and generation. When the informer will next watch a new event on this resource ID, equals can be called and it will compare generation to decide if an event should be triggered. If so, it will replace the value with the actual one otherwise, the resource has not changed and the cache can still keep the lighter version. For external resource, maybe the caching strategy should be created with a Converter<TResource, TLightResource<TResource>>

interface TLightResource<T> {
    bool isSameAs(T real);
}

scrocquesel avatar Feb 02 '22 22:02 scrocquesel

If you clean up cache entries, I think you will be in the same state as at startup, that is being in an unknown previously observed state. It may lead to the issue of not triggering event on deleted resource. So the cache entry should be replaced with something lighter that will always trigger a reconcile or be enough smart to act as the real value regarding equals.

@scrocquesel right, the plan is that in future the Informers will not handle the caching, just storing as you mentioned a minimal information of data to be detect if there was a change (this will be done in the fabric8 client project). This will open space for us to handle the caching specifically to our domain, thus customize it for our needs.

(There is already a separate issues to discuss pruning in the cache itself, what is an other variant / complementary to this issue: https://github.com/java-operator-sdk/java-operator-sdk/issues/892 )

So this should work nicely with K8S objects, but it could be done for PerResourcePollingEventSource, basically identically. For PollingEventSource it's little different story, there might be additional mechanisms needed. (or simply won't done since this is basically just an optimization)

csviri avatar Feb 03 '22 08:02 csviri

To make this efficient, thus to know if a resources exists in an API server or not, we should however cache a minimal information if the resource exists or not. (This is the plan to have however in informer already in fabric8 v6)

csviri avatar Apr 06 '22 08:04 csviri

On operator startup all resources are reconciled, there actually hard to come up with a strategy which one to evict.

csviri avatar Dec 01 '22 13:12 csviri

On operator startup all resources are reconciled, there actually hard to come up with a strategy which one to evict.

Maybe evicting with a jitter so that on a long run, it will smooth cache peak. At startup, you can't decide unless the cache is persistent.

Regarding external resources, I have a use case of a rest api that provide etag/cache-control support. That is the per resource caching poller can evict the actual representation and only keep the id/etag. When polling, the lighter representation should be provided to the poller implementation. Then, it can do a GET request with the etag and if the server returns 302 Not Modified, it can signal the resource do not change. When the accessing the cache of secondary resource from a reconciliation, lighter representation should be polled.

scrocquesel avatar Dec 01 '22 19:12 scrocquesel