java-operator-sdk icon indicating copy to clipboard operation
java-operator-sdk copied to clipboard

Custom bootstrap and CR lifecycle management

Open Donnerbart opened this issue 4 months ago • 9 comments

This is mostly a reality check if our custom bootstrap is using JOSDK as intended or not. Depending on the feedback we can close this issue and create more specific follow-up issues if needed. I just wanted to describe the context once.

// configure operator
final var customMetrics = new CustomResourceMetrics(config, meterRegistry);
final var eventSender = new EventSender(config, client);
final var reconciler = new HiveMQPlatformReconciler(config, customResourceMetrics, client, eventSender);

final var dependentResourceFactory = new HiveMQPlatformOperatorDependentResourceFactory<>(config, eventSender);
final var metrics = MicrometerMetrics.newPerResourceCollectingMicrometerMetricsBuilder(meterRegistry).build();
final var operator = new Operator(override -> override //
        .withConcurrentReconciliationThreads(config.getConcurrentReconciliationThreads())
        .withConcurrentWorkflowExecutorThreads(config.getConcurrentWorkflowThreads())
        .withReconciliationTerminationTimeout(config.getReconciliationTerminationTimeout())
        .withCacheSyncTimeout(config.getCacheSyncTimeout())
        .withKubernetesClient(client)
        .withCloseClientOnStop(closeClient)
        .withDependentResourceFactory(dependentResourceFactory)
        .withMetrics(metrics)
        .withUseSSAToPatchPrimaryResource(useSSA)
        .withSSABasedCreateUpdateMatchForDependentResources(useSSA));
final var reconcilerConfig = operator.getConfigurationService().getConfigurationFor(reconciler); // (1)
final var overrideConfig = ControllerConfigurationOverrider.override(reconcilerConfig);
if (operatorNamespaces.equals(Constants.WATCH_CURRENT_NAMESPACE)) {
    overrideConfig.watchingOnlyCurrentNamespace();
} else if (operatorNamespaces.equals(Constants.WATCH_ALL_NAMESPACES)) {
    overrideConfig.watchingAllNamespaces();
} else {
    final var namespaces = Arrays.stream(operatorNamespaces.split(",")).collect(Collectors.toSet());
    overrideConfig.settingNamespaces(namespaces);
}
if (!operatorSelector.isBlank()) {
    overrideConfig.withLabelSelector(operatorSelector);
}
overrideConfig.withOnAddFilter(platform -> { // (3)
    customResourceMetrics.register(platform);
    return true;
});
final var controller = (Controller<HiveMQPlatform>) operator.register(reconciler, overrideConfig.build());
customResourceMetrics.setCache(controller.getEventSourceManager().getControllerEventSource()); // (2)
  • HiveMQPlatformOperatorDependentResourceFactory implements DependentResourceFactory (it's the only place where we really had to replace Quarkus Arc, otherwise we don't need a CDI).
  • closeClient is true in production and only set to false in tests to not close an injected K8s client (that is still used in the test).
  • useSSA is true in production and only set to false in tests with the K8s mockserver.
  1. Are we creating the configuration and overrides as intended? I tried to reverse engineer how Quarkus and the LocallyRunOperatorExtension are configuring JOSDK, but we still get a warning on the operator.getConfigurationService().getConfigurationFor(reconciler) invocation:
    12:29:37.423 [main] WARN  Default ConfigurationService implementation - Configuration for reconciler
    'hivemq-controller' was not found. Known reconcilers: None.
    
    It feels wrong to get that warning, but I found no other way to create a ControllerConfigurationOverrider.
  2. How to properly implement custom metrics on all custom resources in an efficient way? For example, in CustomResourceMetrics we have a global Gauge for each state of our operator state machine. To calculate the values we count all custom resources that are in that state:
    cache.list()
        .map(CustomResource::getStatus)
        .filter(Objects::nonNull)
        .filter(status -> status.getState().equals(state))
        .count();
    
    That cache is the ControllerEventSource that we set via customResourceMetrics.setCache(controller.getEventSourceManager().getControllerEventSource()). This feels a bit illegal but works very fine for us. In the previous implementation we kept a shadow copy of all custom resources in a CHM, but these were the cloned instances from the reconciliation loop. This impacted the GC and needed constant updates of the cached instances to retrieve the current state. Using the original instances from the cache solved this problem very elegantly for us, including the lifecycle management.
  3. The same problem hits us now again with a new Gauge for an aggregated health status metric per custom resource. We need to register this Gauge once when the CR is added, de-register it when it's removed and in between have access to the current CustomResource::getStatus to get the health state. So we're back to a Map<String, GaugeHolder> with <namespace>-<name> as key. For the Gauge creation I'm using overrideConfig.withOnAddFilter(), which again feels very wrong, but seems to work fine. The Gauge removal is done in the Cleaner::cleanup method of our reconciler. Is there a better way for the lifecycle control and where to put that Gauge? It feels like we cannot put this into the custom resource itself, because of the cloning (the Gauge needs to be a singleton) and to update the correct instance. This is how it looks like in the CustomResourceMetrics right now:
    private final @NotNull Map<String, GaugeHolder> healthMetricGauges = new ConcurrentHashMap<>();
    
    // called from the bootstrap
    public void register(final @NotNull HiveMQPlatform platform) {
        healthMetricGauges.put(getKey(platform), new GaugeHolder(platform));
    }
    
    // called from Cleaner::cleanup
    public void deregister(final @NotNull HiveMQPlatform platform) {
        if (healthMetricGauges.get(getKey(platform)) instanceof GaugeHolder gaugeHolder) {
            meterRegistry.remove(gaugeHolder.gauge);
        }
    }
    
    // called from the main reconciler when the health state has changed
    public void updateHealthMetric(final @NotNull HiveMQPlatform platform) {
        if (healthMetricGauges.get(getKey(platform)) instanceof GaugeHolder gaugeHolder) {
            gaugeHolder.value.set(platform.getStatus().getHealthStatus().getMetricsValue());
        }
    }
    
    private static @NotNull String getKey(final @NotNull HiveMQPlatform platform) {
        return platform.getMetadata().getNamespace() + "-" + platform.getMetadata().getName();
    }
    
    private class GaugeHolder {
    
        private final @NotNull AtomicInteger value = new AtomicInteger(HealthStatus.UNKNOWN.getMetricsValue());
        private final @NotNull Gauge gauge;
    
        public GaugeHolder(final @NotNull HiveMQPlatform platform) {
            this.gauge = Gauge.builder("hivemq.platform.health.system.current", value::get)
                    .tag("namespace", platform.getMetadata().getNamespace())
                    .tag("name", platform.getMetadata().getName())
                    .description("Custom resource health status")
                    .register(meterRegistry);
        }
    }
    

Donnerbart avatar Aug 05 '25 11:08 Donnerbart

Hi @Donnerbart ,

  1. I think we should iterate on that logging message, in some cases sure misleading, here where operator is registered: https://github.com/operator-framework/java-operator-sdk/blob/3723b1c8b72258583e260a40be298bce148c4daf/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/Operator.java#L240

it surely does not have configuration at that point. So we should not look this message. Pls @metacosm take a look too. will create a PR to remove that.

csviri avatar Aug 08 '25 08:08 csviri

Regarding point 2 and 3.

I think implementing a custom metrics interface, or extending the current implementation with micrometer, should do the job:

https://github.com/operator-framework/java-operator-sdk/blob/3723b1c8b72258583e260a40be298bce148c4daf/micrometer-support/src/main/java/io/javaoperatorsdk/operator/monitoring/micrometer/MicrometerMetrics.java#L27

Note that you get callbeack when reconiliation finished for a CR, or when it is deleted; based on that you can have your own index in the metrics implementation.

Alternatively, you have quite strong access rights there, thus for example direct access to the controller :

https://github.com/operator-framework/java-operator-sdk/blob/3723b1c8b72258583e260a40be298bce148c4daf/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/monitoring/Metrics.java#L26

Through that, you can access the EventSourceManager so basically you can access all the caches. Maybe not through a nice api but basically everything is accessible there to create own metrics.

We can extend this interface with additional callbacks if needed. Pls take a look and let us know.

csviri avatar Aug 08 '25 08:08 csviri

added pr for point 1: https://github.com/operator-framework/java-operator-sdk/pull/2891

csviri avatar Aug 08 '25 08:08 csviri

Do you need to override everything manually? In particular, I don't understand the part where you set the namespaces on the controller: this should be done by default so I'm not sure why you do it explicitly? Another things is that you could use register( Reconciler<P> reconciler, Consumer<ControllerConfigurationOverrider<P>> configOverrider) instead of creating the overridden configuration yourself instead.

metacosm avatar Aug 08 '25 10:08 metacosm

Do you need to override everything manually? In particular, I don't understand the part where you set the namespaces on the controller: this should be done by default so I'm not sure why you do it explicitly?

We are not using Quarkus with QOSDK anymore, we refactored our operator to a plain Java application. We configure the operator with our own Helm chart -> env variables -> custom SmallRye based configuration. The operatorNamespaces and operatorSelector variables are coming from that config. So I'm pretty sure we have to configure that manually.

When we still used QOSDK we configured QUARKUS_OPERATOR_SDK_NAMESPACES and QUARKUS_OPERATOR_SDK_CONTROLLERS__CONTROLLERS__SELECTOR in the Helm chart. We replaced this 1:1 with our own env variables and the custom bootstrap I posted here. For a while we even supported the Quarkus env variables for backwards compatibility with custom Helm charts.

The ConfigurationServiceOverrider is mostly used to configure the operator for testing (e.g. disable SSA for unit tests with the K8s mockserver). The thread and timeout configurations are again for backwards compatibility with QOSDK and are coming from the according env variables.

Another things is that you could use register( Reconciler<P> reconciler, Consumer<ControllerConfigurationOverrider<P>> configOverrider) instead of creating the overridden configuration yourself instead.

Oh my, I totally overlooked that register(Reconciler, Consumer) method. This solves (1) for us!

I'll look into the metrics recommendations later, thanks so much for that feedback!

Donnerbart avatar Aug 08 '25 10:08 Donnerbart

Any particular reason why you decided to move away from QOSDK?

metacosm avatar Aug 08 '25 12:08 metacosm

Added a PR for points 2 and 3: https://github.com/operator-framework/java-operator-sdk/pull/2912

Any particular reason why you decided to move away from QOSDK?

  • Reported OOM kills, where we suspected Quarkus as root cause (there were at least 3 open issues at that time reporting memory leaks in Quarkus).
  • We had a lot of CVE reports of transparent dependencies from Quarkus and could not safely bump versions in our project, without the risk of breaking Quarkus. We actually used implementation(enforcedPlatform(libs.quarkus.bom)) to strictly use the dependency versions from Quarkus. But this delayed CVE fixes to the next Quarkus release that bumped the affected versions.
  • Quarkus doesn't provide a default Gradle runtimeClasspath configuration with all resolved dependencies, due to the extension system. This required us to configure Snyk to look up dependency versions from the JAR files in the lib folder.
  • CVE reports for the used base image were marked as "not affected, won't fix" by Red Hat. Our customers were not happy with that result, so we had to switch to our company default base image eclipse-temurin and reverse engineer how a Quarkus application is started.
  • We had to customize the test setup so much, that it felt like we were working more against Quarkus than it was helping us.
    • We have to use a custom K3s container to be up to date and load custom images into it. The default dev services are not usable for us (e.g. lagging behind with K8s versions). Using different container configurations didn't work (Quarkus started both containers although we used different @WithTestResource(), we could never figure out how to fix that).
    • The lifecycle of the operator was bound to the whole test suite, even for tests where we don't need or want an operator instance. Also, using test profiles for different configurations resulted in very hard to debug test failures.
  • Too many layers to solve issues, e.g. https://github.com/operator-framework/java-operator-sdk/issues/2526
  • Knowledge silos: The operator was our only Quarkus application. We couldn't use our knowledge from our other projects (different webserver, different CDI, different base image, ...) and vice versa.
  • In the end we didn't use any unique Quarkus feature, but payed a lot of costs in complexity and maintenance effort. Our custom bootstrap is much simpler and easier to understand. We can now exactly control the lifecycle of the operator and our K3s container in each test, e.g. not starting an operator at all, starting it with a different configuration, even starting multiple operators with different configurations (e.g. for an upgrade test with different versions).
  • The simplified bootstrap halfed the startup time and loads 2000 less classes. The heap dumps are now readable for us (no more noise of JBoss Jandex classes as dominator etc.).

So all in all we love the JOSDK, but QOSDK didn't work out for us. Our code base is now less complex, easier to understand for other engineers and the application actually performs better.

Donnerbart avatar Aug 26 '25 13:08 Donnerbart

Hi @Donnerbart, thank you for the detailed answer. All of it makes perfect sense, hopefully not all of it was a bad experience… Testing is definitely something that we need to improve in QOSDK, though I have to admit that there isn't much we can do for the rest on our side only… 😓

metacosm avatar Aug 27 '25 14:08 metacosm

Added a PR for points 2 and 3: #2912

Any particular reason why you decided to move away from QOSDK?

* Reported OOM kills, where we suspected Quarkus as root cause (there were at least 3 open issues at that time reporting memory leaks in Quarkus).

Do you recall which issues these were by chance? We try to address this kind of issue as quickly as possible, usually…

* The simplified bootstrap halfed the startup time and loads 2000 less classes. The heap dumps are now readable for us (no more noise of JBoss Jandex classes as dominator etc.).

This is actually quite surprising. Could you clarify when / under which conditions Jandex was dominating the heap dumps? Jandex classes should only be present at build time, not runtime.

metacosm avatar Aug 27 '25 15:08 metacosm