dotnet-operator-sdk icon indicating copy to clipboard operation
dotnet-operator-sdk copied to clipboard

[bug]: Reconciliation 409-Conflict failing on Modified entities since v10

Open joaope opened this issue 1 month ago • 8 comments

Describe the bug

Reconciliation is failing on Modified events. No code changes since v9 apart from the updating of namespaces and returns due to breaking changes. Stack trace like so:

fail: KubeOps.Operator.Watcher.ResourceWatcher[0]
      Reconciliation of Modified for CustomObj/mycrd failed.
      k8s.Autorest.HttpOperationException: Operation returned an invalid status code 'Conflict', response body {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on my.custom.objs.io \"mycrd\": the object has been modified; please apply your changes to the latest version and try again","reason":"Conflict","details":{"name":"mycrd","group":"my.custom.objs.io","kind":"mycdrs"},"code":409}
      
         at k8s.Kubernetes.SendRequestRaw(String requestContent, HttpRequestMessage httpRequest, CancellationToken cancellationToken)
         at k8s.AbstractKubernetes.ICustomObjectsOperations_ReplaceNamespacedCustomObjectWithHttpMessagesAsync[T](Object body, String group, String version, String namespaceParameter, String plural, String name, String dryRun, String fieldManager, String fieldValidation, IReadOnlyDictionary`2 customHeaders, CancellationToken cancellationToken)
         at k8s.AbstractKubernetes.k8s.ICustomObjectsOperations.ReplaceNamespacedCustomObjectWithHttpMessagesAsync[T](Object body, String group, String version, String namespaceParameter, String plural, String name, String dryRun, String fieldManager, String fieldValidation, IReadOnlyDictionary`2 customHeaders, CancellationToken cancellationToken)
         at k8s.GenericClient.ReplaceNamespacedAsync[T](T obj, String ns, String name, CancellationToken cancel)
         at KubeOps.KubernetesClient.KubernetesClient.UpdateAsync[TEntity](TEntity entity, CancellationToken cancellationToken)
         at KubeOps.Operator.Reconciliation.Reconciler`1.ReconcileEntity(TEntity entity, CancellationToken cancellationToken)
         at KubeOps.Operator.Reconciliation.Reconciler`1.ReconcileEntity(TEntity entity, CancellationToken cancellationToken)
         at KubeOps.Operator.Reconciliation.Reconciler`1.ReconcileModification(ReconciliationContext`1 reconciliationContext, CancellationToken cancellationToken)
         at KubeOps.Operator.Reconciliation.Reconciler`1.Reconcile(ReconciliationContext`1 reconciliationContext, CancellationToken cancellationToken)
         at KubeOps.Operator.Watcher.ResourceWatcher`1.OnEventAsync(WatchEventType eventType, TEntity entity, CancellationToken cancellationToken)
         at KubeOps.Operator.Watcher.ResourceWatcher`1.WatchClientEventsAsync(CancellationToken stoppingToken)

Worth adding that the problem doesn't seem to happen when turning AutoAttachFinalizers off so I belive the problem lies somewhere around this new Reconciler.cs, in particular this if-statement:

if (operatorSettings.AutoAttachFinalizers)
{
    var finalizers = scope.ServiceProvider.GetKeyedServices<IEntityFinalizer<TEntity>>(KeyedService.AnyKey);

    foreach (var finalizer in finalizers)
    {
        entity.AddFinalizer(finalizer.GetIdentifierName(entity));
    }

    entity = await client.UpdateAsync(entity, cancellationToken);
}

var controller = scope.ServiceProvider.GetRequiredService<IEntityController<TEntity>>();
return await controller.ReconcileAsync(entity, cancellationToken);

I don't have any finalizers but isn't this too late to update an entity with them? And also just before the reconciliation so any changes to the object will generate a 409.

This not only runs if I don't have finalizers but also every time an entity is Added/Modified as this logic weirdly threats both cases as the same when they have fundamental differences.

public async Task<ReconciliationResult<TEntity>> Reconcile(ReconciliationContext<TEntity> reconciliationContext, CancellationToken cancellationToken)
{
    var result = reconciliationContext.EventType switch
    {
        WatchEventType.Added or WatchEventType.Modified =>
            await ReconcileModification(reconciliationContext, cancellationToken),
        WatchEventType.Deleted =>
            await ReconcileDeletion(reconciliationContext, cancellationToken),
        _ => throw new NotSupportedException($"Reconciliation event type {reconciliationContext.EventType} is not supported!"),
    };
    
   // ...
}

These are some initial finds. I might be able to give some more context later. Thanks!

To reproduce

  1. Create controller that only changes entity status and updates it via client.UpdateStatusAsync()
  2. Add new entity to the cluster
  3. See reconciliation running
  4. Modify the object manually
  5. Fails with above exception

Expected behavior

Does not fail reconciliation

joaope avatar Dec 01 '25 23:12 joaope

Actually, something that might be related and around this area was raised by @buehler on #980.

https://github.com/dotnet/dotnet-operator-sdk/pull/980/files#r2517837131

joaope avatar Dec 02 '25 00:12 joaope

@joaope I will have a look into this - just to get it right:

  • you don't have any finalizers?

could you elaborate a little on this? sorry but I don't get it exactly (no native speaker here)

I don't have any finalizers but isn't this too late to update an entity with them? And also just before the reconciliation so any changes to the object will generate a 409.

thanks.

p.s.: as a workaround - until this is fixed - as you mentioned please disable auto-attaching/detatching.

kimpenhaus avatar Dec 02 '25 19:12 kimpenhaus

@joaope I've opened up a PR which only updates the entity when at least a single finalizer is attached. I've also exteneded the existing integration test to cover your scenario:

  • create an instance of a crd
  • durng reconciliation of the added event -> change status

with (and without) the optimization this integration test runs successfully. maybe you can re-check once the PR is completed?

thanks for reporting 😄

kimpenhaus avatar Dec 02 '25 21:12 kimpenhaus

could you elaborate a little on this? sorry but I don't get it exactly (no native speaker here)

I don't have any finalizers but isn't this too late to update an entity with them? And also just before the reconciliation so any changes to the object will generate a 409.

Sorry, I probably worded it badly (not a native speaker either!). I was just thinking out loud as I'm not that famliar with finalizers. Isn't entity.AddFinalizer() something that should happen only once, on WatchEventType.Added?

My understanding is that this logic is being called for both Added and Modified events so it ends up adding the same finalizers (and pushing an Update() in the process) to the same entity over and over again every time it’s modified.

Maybe that's OK, just my lack of knowledge around it and how kubeops internal reconciliation logic works.

joaope avatar Dec 02 '25 23:12 joaope

Smallest repo I can come up with: https://gist.github.com/joaope/8170b2c25d7429f8d5e4b67773764713

It fails on different scenarios:

  1. Deploy operator without any Widget in the cluster and only then create a Widget resource
info: WidgetsOperator.ControlPlane.WidgetController[0]
      WIDGET (my-widget): Reconciling...
info: WidgetsOperator.Operator.ControlPlane.WidgetController[0]
      WIDGET (my-widget): Reconciled
fail: KubeOps.Operator.Watcher.ResourceWatcher[0]
      Reconciliation of Modified for Widget/my-widget failed.
      k8s.Autorest.HttpOperationException: Operation returned an invalid status code 'Conflict', response body {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on widgets.example.com \"my-widget\": the object has been modified; please apply your changes to the latest version and try again","reason":"Conflict","details":{"name":"my-widget","group":"example.com","kind":"widgets"},"code":409}
      
         at k8s.Kubernetes.SendRequestRaw(String requestContent, HttpRequestMessage httpRequest, CancellationToken cancellationToken)
         at k8s.AbstractKubernetes.ICustomObjectsOperations_ReplaceNamespacedCustomObjectWithHttpMessagesAsync[T](Object body, String group, String version, String namespaceParameter, String plural, String name, String dryRun, String fieldManager, String fieldValidation, IReadOnlyDictionary`2 customHeaders, CancellationToken cancellationToken)
         at k8s.AbstractKubernetes.k8s.ICustomObjectsOperations.ReplaceNamespacedCustomObjectWithHttpMessagesAsync[T](Object body, String group, String version, String namespaceParameter, String plural, String name, String dryRun, String fieldManager, String fieldValidation, IReadOnlyDictionary`2 customHeaders, CancellationToken cancellationToken)
         at k8s.GenericClient.ReplaceNamespacedAsync[T](T obj, String ns, String name, CancellationToken cancel)
         at KubeOps.KubernetesClient.KubernetesClient.UpdateAsync[TEntity](TEntity entity, CancellationToken cancellationToken)
         at KubeOps.Operator.Reconciliation.Reconciler`1.ReconcileEntity(TEntity entity, CancellationToken cancellationToken)
         at KubeOps.Operator.Reconciliation.Reconciler`1.ReconcileEntity(TEntity entity, CancellationToken cancellationToken)
         at KubeOps.Operator.Reconciliation.Reconciler`1.ReconcileModification(ReconciliationContext`1 reconciliationContext, CancellationToken cancellationToken)
         at KubeOps.Operator.Reconciliation.Reconciler`1.Reconcile(ReconciliationContext`1 reconciliationContext, CancellationToken cancellationToken)
         at KubeOps.Operator.Watcher.ResourceWatcher`1.OnEventAsync(WatchEventType eventType, TEntity entity, CancellationToken cancellationToken)
         at KubeOps.Operator.Watcher.ResourceWatcher`1.WatchClientEventsAsync(CancellationToken stoppingToken)
  1. Deploy operator with a Widget resource already in the cluster. Meanwhile delete and recreate the resource
dbug: Microsoft.Extensions.Hosting.Internal.Host[2]
      Hosting started
info: WidgetsOperator.ControlPlane.WidgetController[0]
      WIDGET (my-widget): Reconciling...
info: WidgetsOperator.ControlPlane.WidgetController[0]
      WIDGET (my-widget): Reconciled
info: WidgetsOperator.ControlPlane.WidgetController[0]
      WIDGET (my-widget): Deleted
info: WidgetsOperator.ControlPlane.WidgetController[0]
      WIDGET (my-widget): Reconciling...
info: WidgetsOperator.ControlPlane.WidgetController[0]
      WIDGET (my-widget): Reconciled
fail: KubeOps.Operator.Watcher.ResourceWatcher[0]
      Reconciliation of Modified for Widget/my-widget failed.
      k8s.Autorest.HttpOperationException: Operation returned an invalid status code 'Conflict', response body {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on widgets.example.com \"my-widget\": the object has been modified; please apply your changes to the latest version and try again","reason":"Conflict","details":{"name":"my-widget","group":"example.com","kind":"widgets"},"code":409}
      
         at k8s.Kubernetes.SendRequestRaw(String requestContent, HttpRequestMessage httpRequest, CancellationToken cancellationToken)
         at k8s.AbstractKubernetes.ICustomObjectsOperations_ReplaceNamespacedCustomObjectWithHttpMessagesAsync[T](Object body, String group, String version, String namespaceParameter, String plural, String name, String dryRun, String fieldManager, String fieldValidation, IReadOnlyDictionary`2 customHeaders, CancellationToken cancellationToken)
         at k8s.AbstractKubernetes.k8s.ICustomObjectsOperations.ReplaceNamespacedCustomObjectWithHttpMessagesAsync[T](Object body, String group, String version, String namespaceParameter, String plural, String name, String dryRun, String fieldManager, String fieldValidation, IReadOnlyDictionary`2 customHeaders, CancellationToken cancellationToken)
         at k8s.GenericClient.ReplaceNamespacedAsync[T](T obj, String ns, String name, CancellationToken cancel)
         at KubeOps.KubernetesClient.KubernetesClient.UpdateAsync[TEntity](TEntity entity, CancellationToken cancellationToken)
         at KubeOps.Operator.Reconciliation.Reconciler`1.ReconcileEntity(TEntity entity, CancellationToken cancellationToken)
         at KubeOps.Operator.Reconciliation.Reconciler`1.ReconcileEntity(TEntity entity, CancellationToken cancellationToken)
         at KubeOps.Operator.Reconciliation.Reconciler`1.ReconcileModification(ReconciliationContext`1 reconciliationContext, CancellationToken cancellationToken)
         at KubeOps.Operator.Reconciliation.Reconciler`1.Reconcile(ReconciliationContext`1 reconciliationContext, CancellationToken cancellationToken)
         at KubeOps.Operator.Watcher.ResourceWatcher`1.OnEventAsync(WatchEventType eventType, TEntity entity, CancellationToken cancellationToken)
         at KubeOps.Operator.Watcher.ResourceWatcher`1.WatchClientEventsAsync(CancellationToken stoppingToken)

So the reconciliation client-side (my controller) looks like it actually happens? The error is internal to the lib?

Obviously, if I turn AutoFinalizers off, nothing of the above happens.

joaope avatar Dec 03 '25 00:12 joaope

@joaope sorry for delayed answer but I hadn't had the chance to have a closer look.

first of all I think there is a major issue in your code (in the gist class WidgetController line 13)

await client.UpdateStatusAsync(entity, cancellationToken);

the UpdateStatusAsync returns the modified/updated entity - this needs to be:

entity = await client.UpdateStatusAsync(entity, cancellationToken);

when not using the updated entity every further attempt to modify the entity will lead to a 409.

Second I saw is, that the 409 exception was raised after the log entry WIDGET (my-widget): Reconciled

this backs my assumption that the main cause is line 13 in the WidgetController. Maybe you can just quickly fix this and give it a retry.

My understanding is that this logic is being called for both Added and Modified events so it ends up adding the same finalizers (and pushing an Update() in the process) to the same entity over and over again every time it’s modified.

this behaves actually different - Kubernetes differentiates by resource version and generation. in short the difference is that the resource version is changed with every write on the crd while the generation is only changed when the spec is changed.

as I mentioned in the PR attaching a finalizer changes the resource version but not the generation. the operator only reconciles on new generations as this is the trigger for spec changes.

kimpenhaus avatar Dec 05 '25 07:12 kimpenhaus

the UpdateStatusAsync returns the modified/updated entity - this needs to be:

entity = await client.UpdateStatusAsync(entity, cancellationToken);

You're absolutely right, dumb me totally missed that when migrating to v10.

I was really hoping that would be the issue but unfortunately, even after changing that, the 409 still happens under the same scenarios.

Second I saw is, that the 409 exception was raised after the log entry WIDGET (my-widget): Reconciled

Correct. All 409 exceptions they originate from within the library. The custom controller reconcilations, they all run to completion. You can see that the stacktrace doesn't have any controller code in it.

Worth saying that the widget status actually changes, so from a controller perspective, it's really all fine. I think I only saw exceptions happening after WidgetController.ReconcileAsync() successfuly returns.

this behaves actually different - Kubernetes differentiates by resource version and generation. in short the difference is that the resource version is changed with every write on the crd while the generation is only changed when the spec is changed.

as I mentioned in the PR attaching a finalizer changes the resource version but not the generation. the operator only reconciles on new generations as this is the trigger for spec changes.

Appreciate the explanation. This is very good info.

joaope avatar Dec 05 '25 23:12 joaope

Anyway, I was building the lib locally and doing some debugging.

I'm 90% convinced this is the race condition already raised on #977. The fact the finalizers are being attached in-between operator's reconciliations just made it more prominent.

Your #1003 might make it slightly better as it won't trigger resources replacements and reconciliations so often. I would probably go a step further and actually check if the entity already has the finalizers entity.HasFinalizer("id") and only add them when false. And at this point I would probably also only auto-attach on Added event, like I previously mentioned not sure why it needs to happen on Modified.

Attaching of finalizers is probably something that belongs to operators. Not sure folks will want auto-attach on most of their use cases. Making it the default behaviour is a stretch, specially the way it works right now.

joaope avatar Dec 06 '25 03:12 joaope

Just want to add that I'm having precisely this same issue since the v10 upgrade. I also don't have any finalizers, so would really appreciate a fix for this.

badgeratu avatar Dec 18 '25 09:12 badgeratu

I will have a look into this in depth by the next week. Must be an issue when there is no finalizer - we are running this code on production since June with no issues. (but we do have finalizers)

As a workaround you could just disable the automated finalizer handling:

settings.AutoAttachFinalizers = false;
settings.AutoDetachFinalizers = false;

like described here: https://dotnet.github.io/dotnet-operator-sdk/docs/operator/advanced-configuration#finalizer-management

sorry if this is causing any trouble on your side!

kimpenhaus avatar Dec 18 '25 09:12 kimpenhaus

That workaround works great for me, thank you very much!

And no need to apologize - this project is incredible, the feature set and documentation all excellent. Compared to others in our company attempting to build operators in Go, this is a whole different developer and deployment experience. Keep up the great work!

badgeratu avatar Dec 18 '25 09:12 badgeratu