source-controller icon indicating copy to clipboard operation
source-controller copied to clipboard

Cache metric issues

Open darkowlzz opened this issue 1 year ago • 2 comments

This is to highlight the issue with the cache metrics and discuss how to handle it with respect to the new generic typed cache.

At present, in source-controller we have a custom cache implementation for storing the Helm repository index, refer https://github.com/fluxcd/source-controller/tree/v1.4.1/internal/cache. This cache can store any value in the form of an empty interface interface{}. For instrumenting the cache, it exports a metric by the name gotk_cache_events_total with labels event_type, name and namespace for type of cache event (hit or miss), name of the Flux object for which the cache is used and namespace of the object, respectively.

Following is how it appears when exported:

# HELP gotk_cache_events_total Total number of cache retrieval events for a Gitops Toolkit resource reconciliation.
# TYPE gotk_cache_events_total counter
gotk_cache_events_total{event_type="cache_hit",name="podinfo",namespace="default"} 3
gotk_cache_events_total{event_type="cache_miss",name="podinfo",namespace="default"} 1

If it is known that the only cache in source-controller is the Helm index cache, this seems fine. But when we need to add other caches in source-controller for other purposes, this metric gets confusing.

In https://github.com/fluxcd/pkg/tree/main/cache, a new generic typed cache is implemented which will be used for different cache implementations across Flux. See https://github.com/fluxcd/pkg/pull/817 for the latest implementation of the generic typed cache.

The following sections describe the issues with the existing metrics with respect to the new generic typed cache.

Lack of object kind information

When a new cache instance (using the existing cache implementation), say auth token cache, is added for GitRepository, OCIRepository, etc, due to lack of the object kind in the metric label, we can't determine which object a metric belongs to. Two objects of different kinds may have the same name and namespace.

The new generic typed cache also has a similar metric but it also includes the object kind, which would help determine the associated object accurately.

Fixed metric name

The cache metric has hard-coded name, gotk_cache_events_total. If new cache instance and metrics are added, they'll have the same name. The metric registerer panics with the error duplicate metrics collector registration attempted when attempting to register two metrics with the same name.

This is true for the new generic typed cache too. The metric name is hard-coded.

In case of the old/existing cache implementation, the stored value is an empty interface, which can store any value. This enables the usage of the same cache for storing different items, although that may not be ideal, depending on the use case. A single shared cache for everything would work fine.

But the new generic typed cache can't be shared because the cache instance are created with their type and can only store values of specific type. This property of the new cache requires creating separate caches for storing different types of data. A cache for storing the helm index cache of type IndexFile. Another cache for storing auth tokens of type string. The two with hard-coded metric names can't be registered with the prometheus registerer.


Possible solutions

I believe ideally, the new cache would replace the existing cache in source-controller. But for that we need to address the above issues. Using the new generic typed cache will address the first issue as it includes the object kind in the metric. But for addressing the fixed metric name issue, we may need to allow dynamic/variable names.

The constructor of the cache metrics can accept a prefix which can be added before the metric's fixed name. For example, if the fixed name is cache_events_total, a prefix gotk_helm_index could be passed, resulting in a new metric with name gotk_helm_index_cache_events_total. This would make is easier to instrument individual caches and identify them easily. In case of the helm index, all the metrics will belong to HelmRepository, but for auth token cache, the same instance of cache may be shared amongst multiple reconcilers and the object kind would help filter the metric for specific object kinds.

This would also benefit other users of our packages who may like to enable the caching support. They can pass their own prefix, as gotk_cache wouldn't mean anything to their users.

Concerns

  • Although we don't publish any monitoring dashboard that uses the existing cache metric, renaming the metric would break the existing monitoring setup of the users. For just helm index cache metrics, maybe we can find some way to continue exporting the metric with the same name, but we may eventually have to drop it.
  • For the auth token caches, should the cache be shared across the objects kinds/reconcilers? The considerations for this may not be based on the metrics, but the decision will affect the way metrics will be exported. If it's shared, the auth cache metric could be something like gotk_auth_cache_events_total. If it's not shared, it could be something like gotk_git_auth_cache_events_total, gotk_oci_auth_cache_events_total, etc.

The new generic typed cache implementation will be updated accordingly.

darkowlzz avatar Oct 21 '24 15:10 darkowlzz

I'm for adding the cache prefix to the package and have dedicated metrics per kind in source-controller. Given that Git and OCI can't share the same key/token for Azure, I suspect this will be the case for other providers too.

stefanprodan avatar Oct 21 '24 16:10 stefanprodan

I think the cache metrics should have specific names based on use case, and not specifically kind. Each use case will most likely always have a specific kind associated with, but I don't think the kind is what matters the most, as we may have two different use cases that are both associated with the same kind. I can't think of a specific example from the top of my head, but storing Helm indexes and auth tokens are already an example of two different use cases for caching, and the kind they relate to is only crucial for the former.

So I'm also for adding a metric name prefix to the metric/cache constructor. I'm fine with making the kind part of this prefix, as part of the use case name.

matheuscscp avatar Oct 22 '24 11:10 matheuscscp

In the last week's dev meeting, we discussed this and arrived at the following conclusion:

  • The new cache implementation should accept prefix for the metrics to prevent any conflicts.
  • The caches in source-controller will use prefix for cache metrics. The metrics will have enough information in the name to determine what the metric is about and each kind will have their own separate cache so that they can work independently, resulting in metrics like gotk_git_auth_cache_events_total, gotk_oci_auth_cache_events_total.
  • :warning: The existing source-controller gotk_cache_events_total metric for helm index will be replaced with more specific gotk_helm_index_cache_events_total metric. This will be a breaking change for the users of this metric.
  • Although the metric names in this case indicate the kind of the object they are associated with, for consistency with other object based metrics, the metric labels will have kind, name and namespace of the associated object.

https://github.com/fluxcd/pkg/pull/817 has been updated accordingly to support these use cases.

More discussion is required to determine how the new caches will be configured. But that can be discussed in a separate issue related to those new caches.

darkowlzz avatar Nov 07 '24 19:11 darkowlzz

Re-opening to keep track of when this will be implemented, especially for the helm index metrics change which will be a breaking change and need to be mentioned in the release notes.

darkowlzz avatar Nov 07 '24 20:11 darkowlzz