[bug]: Reconciliation 409-Conflict failing on Modified entities since v10
Describe the bug
Reconciliation is failing on Modified events. No code changes since v9 apart from the updating of namespaces and returns due to breaking changes. Stack trace like so:
fail: KubeOps.Operator.Watcher.ResourceWatcher[0]
Reconciliation of Modified for CustomObj/mycrd failed.
k8s.Autorest.HttpOperationException: Operation returned an invalid status code 'Conflict', response body {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on my.custom.objs.io \"mycrd\": the object has been modified; please apply your changes to the latest version and try again","reason":"Conflict","details":{"name":"mycrd","group":"my.custom.objs.io","kind":"mycdrs"},"code":409}
at k8s.Kubernetes.SendRequestRaw(String requestContent, HttpRequestMessage httpRequest, CancellationToken cancellationToken)
at k8s.AbstractKubernetes.ICustomObjectsOperations_ReplaceNamespacedCustomObjectWithHttpMessagesAsync[T](Object body, String group, String version, String namespaceParameter, String plural, String name, String dryRun, String fieldManager, String fieldValidation, IReadOnlyDictionary`2 customHeaders, CancellationToken cancellationToken)
at k8s.AbstractKubernetes.k8s.ICustomObjectsOperations.ReplaceNamespacedCustomObjectWithHttpMessagesAsync[T](Object body, String group, String version, String namespaceParameter, String plural, String name, String dryRun, String fieldManager, String fieldValidation, IReadOnlyDictionary`2 customHeaders, CancellationToken cancellationToken)
at k8s.GenericClient.ReplaceNamespacedAsync[T](T obj, String ns, String name, CancellationToken cancel)
at KubeOps.KubernetesClient.KubernetesClient.UpdateAsync[TEntity](TEntity entity, CancellationToken cancellationToken)
at KubeOps.Operator.Reconciliation.Reconciler`1.ReconcileEntity(TEntity entity, CancellationToken cancellationToken)
at KubeOps.Operator.Reconciliation.Reconciler`1.ReconcileEntity(TEntity entity, CancellationToken cancellationToken)
at KubeOps.Operator.Reconciliation.Reconciler`1.ReconcileModification(ReconciliationContext`1 reconciliationContext, CancellationToken cancellationToken)
at KubeOps.Operator.Reconciliation.Reconciler`1.Reconcile(ReconciliationContext`1 reconciliationContext, CancellationToken cancellationToken)
at KubeOps.Operator.Watcher.ResourceWatcher`1.OnEventAsync(WatchEventType eventType, TEntity entity, CancellationToken cancellationToken)
at KubeOps.Operator.Watcher.ResourceWatcher`1.WatchClientEventsAsync(CancellationToken stoppingToken)
Worth adding that the problem doesn't seem to happen when turning AutoAttachFinalizers off so I belive the problem lies somewhere around this new Reconciler.cs, in particular this if-statement:
if (operatorSettings.AutoAttachFinalizers)
{
var finalizers = scope.ServiceProvider.GetKeyedServices<IEntityFinalizer<TEntity>>(KeyedService.AnyKey);
foreach (var finalizer in finalizers)
{
entity.AddFinalizer(finalizer.GetIdentifierName(entity));
}
entity = await client.UpdateAsync(entity, cancellationToken);
}
var controller = scope.ServiceProvider.GetRequiredService<IEntityController<TEntity>>();
return await controller.ReconcileAsync(entity, cancellationToken);
I don't have any finalizers but isn't this too late to update an entity with them? And also just before the reconciliation so any changes to the object will generate a 409.
This not only runs if I don't have finalizers but also every time an entity is Added/Modified as this logic weirdly threats both cases as the same when they have fundamental differences.
public async Task<ReconciliationResult<TEntity>> Reconcile(ReconciliationContext<TEntity> reconciliationContext, CancellationToken cancellationToken)
{
var result = reconciliationContext.EventType switch
{
WatchEventType.Added or WatchEventType.Modified =>
await ReconcileModification(reconciliationContext, cancellationToken),
WatchEventType.Deleted =>
await ReconcileDeletion(reconciliationContext, cancellationToken),
_ => throw new NotSupportedException($"Reconciliation event type {reconciliationContext.EventType} is not supported!"),
};
// ...
}
These are some initial finds. I might be able to give some more context later. Thanks!
To reproduce
- Create controller that only changes entity status and updates it via
client.UpdateStatusAsync() - Add new entity to the cluster
- See reconciliation running
- Modify the object manually
- Fails with above exception
Expected behavior
Does not fail reconciliation
Actually, something that might be related and around this area was raised by @buehler on #980.
https://github.com/dotnet/dotnet-operator-sdk/pull/980/files#r2517837131
@joaope I will have a look into this - just to get it right:
- you don't have any finalizers?
could you elaborate a little on this? sorry but I don't get it exactly (no native speaker here)
I don't have any finalizers but isn't this too late to update an entity with them? And also just before the reconciliation so any changes to the object will generate a 409.
thanks.
p.s.: as a workaround - until this is fixed - as you mentioned please disable auto-attaching/detatching.
@joaope I've opened up a PR which only updates the entity when at least a single finalizer is attached. I've also exteneded the existing integration test to cover your scenario:
- create an instance of a crd
- durng reconciliation of the added event -> change status
with (and without) the optimization this integration test runs successfully. maybe you can re-check once the PR is completed?
thanks for reporting 😄
could you elaborate a little on this? sorry but I don't get it exactly (no native speaker here)
I don't have any finalizers but isn't this too late to update an entity with them? And also just before the reconciliation so any changes to the object will generate a 409.
Sorry, I probably worded it badly (not a native speaker either!). I was just thinking out loud as I'm not that famliar with finalizers. Isn't entity.AddFinalizer() something that should happen only once, on WatchEventType.Added?
My understanding is that this logic is being called for both Added and Modified events so it ends up adding the same finalizers (and pushing an Update() in the process) to the same entity over and over again every time it’s modified.
Maybe that's OK, just my lack of knowledge around it and how kubeops internal reconciliation logic works.
Smallest repo I can come up with: https://gist.github.com/joaope/8170b2c25d7429f8d5e4b67773764713
It fails on different scenarios:
- Deploy operator without any Widget in the cluster and only then create a Widget resource
info: WidgetsOperator.ControlPlane.WidgetController[0]
WIDGET (my-widget): Reconciling...
info: WidgetsOperator.Operator.ControlPlane.WidgetController[0]
WIDGET (my-widget): Reconciled
fail: KubeOps.Operator.Watcher.ResourceWatcher[0]
Reconciliation of Modified for Widget/my-widget failed.
k8s.Autorest.HttpOperationException: Operation returned an invalid status code 'Conflict', response body {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on widgets.example.com \"my-widget\": the object has been modified; please apply your changes to the latest version and try again","reason":"Conflict","details":{"name":"my-widget","group":"example.com","kind":"widgets"},"code":409}
at k8s.Kubernetes.SendRequestRaw(String requestContent, HttpRequestMessage httpRequest, CancellationToken cancellationToken)
at k8s.AbstractKubernetes.ICustomObjectsOperations_ReplaceNamespacedCustomObjectWithHttpMessagesAsync[T](Object body, String group, String version, String namespaceParameter, String plural, String name, String dryRun, String fieldManager, String fieldValidation, IReadOnlyDictionary`2 customHeaders, CancellationToken cancellationToken)
at k8s.AbstractKubernetes.k8s.ICustomObjectsOperations.ReplaceNamespacedCustomObjectWithHttpMessagesAsync[T](Object body, String group, String version, String namespaceParameter, String plural, String name, String dryRun, String fieldManager, String fieldValidation, IReadOnlyDictionary`2 customHeaders, CancellationToken cancellationToken)
at k8s.GenericClient.ReplaceNamespacedAsync[T](T obj, String ns, String name, CancellationToken cancel)
at KubeOps.KubernetesClient.KubernetesClient.UpdateAsync[TEntity](TEntity entity, CancellationToken cancellationToken)
at KubeOps.Operator.Reconciliation.Reconciler`1.ReconcileEntity(TEntity entity, CancellationToken cancellationToken)
at KubeOps.Operator.Reconciliation.Reconciler`1.ReconcileEntity(TEntity entity, CancellationToken cancellationToken)
at KubeOps.Operator.Reconciliation.Reconciler`1.ReconcileModification(ReconciliationContext`1 reconciliationContext, CancellationToken cancellationToken)
at KubeOps.Operator.Reconciliation.Reconciler`1.Reconcile(ReconciliationContext`1 reconciliationContext, CancellationToken cancellationToken)
at KubeOps.Operator.Watcher.ResourceWatcher`1.OnEventAsync(WatchEventType eventType, TEntity entity, CancellationToken cancellationToken)
at KubeOps.Operator.Watcher.ResourceWatcher`1.WatchClientEventsAsync(CancellationToken stoppingToken)
- Deploy operator with a Widget resource already in the cluster. Meanwhile delete and recreate the resource
dbug: Microsoft.Extensions.Hosting.Internal.Host[2]
Hosting started
info: WidgetsOperator.ControlPlane.WidgetController[0]
WIDGET (my-widget): Reconciling...
info: WidgetsOperator.ControlPlane.WidgetController[0]
WIDGET (my-widget): Reconciled
info: WidgetsOperator.ControlPlane.WidgetController[0]
WIDGET (my-widget): Deleted
info: WidgetsOperator.ControlPlane.WidgetController[0]
WIDGET (my-widget): Reconciling...
info: WidgetsOperator.ControlPlane.WidgetController[0]
WIDGET (my-widget): Reconciled
fail: KubeOps.Operator.Watcher.ResourceWatcher[0]
Reconciliation of Modified for Widget/my-widget failed.
k8s.Autorest.HttpOperationException: Operation returned an invalid status code 'Conflict', response body {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on widgets.example.com \"my-widget\": the object has been modified; please apply your changes to the latest version and try again","reason":"Conflict","details":{"name":"my-widget","group":"example.com","kind":"widgets"},"code":409}
at k8s.Kubernetes.SendRequestRaw(String requestContent, HttpRequestMessage httpRequest, CancellationToken cancellationToken)
at k8s.AbstractKubernetes.ICustomObjectsOperations_ReplaceNamespacedCustomObjectWithHttpMessagesAsync[T](Object body, String group, String version, String namespaceParameter, String plural, String name, String dryRun, String fieldManager, String fieldValidation, IReadOnlyDictionary`2 customHeaders, CancellationToken cancellationToken)
at k8s.AbstractKubernetes.k8s.ICustomObjectsOperations.ReplaceNamespacedCustomObjectWithHttpMessagesAsync[T](Object body, String group, String version, String namespaceParameter, String plural, String name, String dryRun, String fieldManager, String fieldValidation, IReadOnlyDictionary`2 customHeaders, CancellationToken cancellationToken)
at k8s.GenericClient.ReplaceNamespacedAsync[T](T obj, String ns, String name, CancellationToken cancel)
at KubeOps.KubernetesClient.KubernetesClient.UpdateAsync[TEntity](TEntity entity, CancellationToken cancellationToken)
at KubeOps.Operator.Reconciliation.Reconciler`1.ReconcileEntity(TEntity entity, CancellationToken cancellationToken)
at KubeOps.Operator.Reconciliation.Reconciler`1.ReconcileEntity(TEntity entity, CancellationToken cancellationToken)
at KubeOps.Operator.Reconciliation.Reconciler`1.ReconcileModification(ReconciliationContext`1 reconciliationContext, CancellationToken cancellationToken)
at KubeOps.Operator.Reconciliation.Reconciler`1.Reconcile(ReconciliationContext`1 reconciliationContext, CancellationToken cancellationToken)
at KubeOps.Operator.Watcher.ResourceWatcher`1.OnEventAsync(WatchEventType eventType, TEntity entity, CancellationToken cancellationToken)
at KubeOps.Operator.Watcher.ResourceWatcher`1.WatchClientEventsAsync(CancellationToken stoppingToken)
So the reconciliation client-side (my controller) looks like it actually happens? The error is internal to the lib?
Obviously, if I turn AutoFinalizers off, nothing of the above happens.
@joaope sorry for delayed answer but I hadn't had the chance to have a closer look.
first of all I think there is a major issue in your code (in the gist class WidgetController line 13)
await client.UpdateStatusAsync(entity, cancellationToken);
the UpdateStatusAsync returns the modified/updated entity - this needs to be:
entity = await client.UpdateStatusAsync(entity, cancellationToken);
when not using the updated entity every further attempt to modify the entity will lead to a 409.
Second I saw is, that the 409 exception was raised after the log entry WIDGET (my-widget): Reconciled
this backs my assumption that the main cause is line 13 in the WidgetController. Maybe you can just quickly fix this and give it a retry.
My understanding is that this logic is being called for both Added and Modified events so it ends up adding the same finalizers (and pushing an Update() in the process) to the same entity over and over again every time it’s modified.
this behaves actually different - Kubernetes differentiates by resource version and generation. in short the difference is that the resource version is changed with every write on the crd while the generation is only changed when the spec is changed.
as I mentioned in the PR attaching a finalizer changes the resource version but not the generation. the operator only reconciles on new generations as this is the trigger for spec changes.
the UpdateStatusAsync returns the modified/updated entity - this needs to be:
entity = await client.UpdateStatusAsync(entity, cancellationToken);
You're absolutely right, dumb me totally missed that when migrating to v10.
I was really hoping that would be the issue but unfortunately, even after changing that, the 409 still happens under the same scenarios.
Second I saw is, that the 409 exception was raised after the log entry WIDGET (my-widget): Reconciled
Correct. All 409 exceptions they originate from within the library. The custom controller reconcilations, they all run to completion. You can see that the stacktrace doesn't have any controller code in it.
Worth saying that the widget status actually changes, so from a controller perspective, it's really all fine. I think I only saw exceptions happening after WidgetController.ReconcileAsync() successfuly returns.
this behaves actually different - Kubernetes differentiates by resource version and generation. in short the difference is that the resource version is changed with every write on the crd while the generation is only changed when the spec is changed.
as I mentioned in the PR attaching a finalizer changes the resource version but not the generation. the operator only reconciles on new generations as this is the trigger for spec changes.
Appreciate the explanation. This is very good info.
Anyway, I was building the lib locally and doing some debugging.
I'm 90% convinced this is the race condition already raised on #977. The fact the finalizers are being attached in-between operator's reconciliations just made it more prominent.
Your #1003 might make it slightly better as it won't trigger resources replacements and reconciliations so often. I would probably go a step further and actually check if the entity already has the finalizers entity.HasFinalizer("id") and only add them when false. And at this point I would probably also only auto-attach on Added event, like I previously mentioned not sure why it needs to happen on Modified.
Attaching of finalizers is probably something that belongs to operators. Not sure folks will want auto-attach on most of their use cases. Making it the default behaviour is a stretch, specially the way it works right now.
Just want to add that I'm having precisely this same issue since the v10 upgrade. I also don't have any finalizers, so would really appreciate a fix for this.
I will have a look into this in depth by the next week. Must be an issue when there is no finalizer - we are running this code on production since June with no issues. (but we do have finalizers)
As a workaround you could just disable the automated finalizer handling:
settings.AutoAttachFinalizers = false;
settings.AutoDetachFinalizers = false;
like described here: https://dotnet.github.io/dotnet-operator-sdk/docs/operator/advanced-configuration#finalizer-management
sorry if this is causing any trouble on your side!
That workaround works great for me, thank you very much!
And no need to apologize - this project is incredible, the feature set and documentation all excellent. Compared to others in our company attempting to build operators in Go, this is a whole different developer and deployment experience. Keep up the great work!