dotnet-operator-sdk
dotnet-operator-sdk copied to clipboard
ArgumentNullException in cache timer
Describe the bug I'm running with an operator using a forked repo with the v6.5.3 tag checked out so debugging will be easier.
After letting operator run for several minutes it crashes with an unhandled ArgumentNullException
being thrown from a timer doing something with your resource cache. It's failing here:
// ResourceCache{TEntity}.cs: 104
private bool Exists(TEntity resource) => _cache.ContainsKey(resource.Metadata.Uid);
I hacked around this by adding a string.IsNullOrEmpty()
check:
private bool Exists(TEntity resource) => !string.IsNullOrEmpty(resource.Metadata.Uid) && _cache.ContainsKey(resource.Metadata.Uid);
To Reproduce
This isn't super reproducable but this seems to happen after the controller's StatusModifiedAsync()
method is called. I created 20 custom resources with status and then patch the status when each resource is reconciled. Sometimes an ArgumentException
is thrown.
Expected behavior Expected no crash.
Screenshots If applicable, add screenshots to help explain your problem.
Additional context Here's the stack trace:
Unhandled exception. System.ArgumentNullException: Value cannot be null. (Parameter 'key')
at System.ThrowHelper.ThrowArgumentNullException(String name)
at System.Collections.Concurrent.ConcurrentDictionary`2.ContainsKey(TKey key)
at KubeOps.Operator.Caching.ResourceCache`1.Exists(TEntity resource) in C:\src\neonKUBE\Lib-forked\KubeOps\src\KubeOps\Operator\Caching\ResourceCache{TEntity}.cs:line 104
at KubeOps.Operator.Caching.ResourceCache`1.CompareCache(TEntity resource) in C:\src\neonKUBE\Lib-forked\KubeOps\src\KubeOps\Operator\Caching\ResourceCache{TEntity}.cs:line 79
at KubeOps.Operator.Caching.ResourceCache`1.Upsert(TEntity resource, CacheComparisonResult& result) in C:\src\neonKUBE\Lib-forked\KubeOps\src\KubeOps\Operator\Caching\ResourceCache{TEntity}.cs:line 36
at KubeOps.Operator.Controller.EventQueue`1.<>c__DisplayClass19_0.<<UpdateResourceData>b__0>d.MoveNext() in C:\src\neonKUBE\Lib-forked\KubeOps\src\KubeOps\Operator\Controller\EventQueue.cs:line 223
--- End of stack trace from previous location ---
at System.Reactive.PlatformServices.ExceptionServicesImpl.Rethrow(Exception exception) in /_/Rx.NET/Source/src/System.Reactive/Internal/ExceptionServicesImpl.cs:line 19
at System.Reactive.ExceptionHelpers.Throw(Exception exception) in /_/Rx.NET/Source/src/System.Reactive/Internal/ExceptionServices.cs:line 16
at System.Reactive.Stubs.<>c.<.cctor>b__2_1(Exception ex) in /_/Rx.NET/Source/src/System.Reactive/Internal/Stubs.cs:line 16
at System.Reactive.AnonymousObserver`1.OnErrorCore(Exception error) in /_/Rx.NET/Source/src/System.Reactive/AnonymousObserver.cs:line 73
at System.Reactive.ObserverBase`1.OnError(Exception error) in /_/Rx.NET/Source/src/System.Reactive/ObserverBase.cs:line 59
at System.Reactive.Linq.ObservableImpl.ConcatMany`1.ConcatManyOuterObserver.Drain() in /_/Rx.NET/Source/src/System.Reactive/Linq/Observable/ConcatMany.cs:line 149
at System.Reactive.Linq.ObservableImpl.ConcatMany`1.ConcatManyOuterObserver.InnerComplete() in /_/Rx.NET/Source/src/System.Reactive/Linq/Observable/ConcatMany.cs:line 119
at System.Reactive.Linq.ObservableImpl.ConcatMany`1.ConcatManyOuterObserver.InnerObserver.OnCompleted() in /_/Rx.NET/Source/src/System.Reactive/Linq/Observable/ConcatMany.cs:line 214
at System.Reactive.Sink`1.ForwardOnCompleted() in /_/Rx.NET/Source/src/System.Reactive/Internal/Sink.cs:line 54
at System.Reactive.Sink`2.OnCompleted() in /_/Rx.NET/Source/src/System.Reactive/Internal/Sink.cs:line 96
at System.Reactive.Threading.Tasks.TaskObservableExtensions.EmitTaskResult(Task task, IObserver`1 subject) in /_/Rx.NET/Source/src/System.Reactive/Threading/Tasks/TaskObservableExtensions.cs:line 161
at System.Reactive.Threading.Tasks.TaskObservableExtensions.SlowTaskObservable.<>c.<Subscribe>b__3_2(ValueTuple`2 tuple2) in /_/Rx.NET/Source/src/System.Reactive/Threading/Tasks/TaskObservableExtensions.cs:line 49
at System.Reactive.Concurrency.Scheduler.<>c__75`1.<ScheduleAction>b__75_0(IScheduler _, ValueTuple`2 tuple) in /_/Rx.NET/Source/src/System.Reactive/Concurrency/Scheduler.Simple.cs:line 65
at System.Reactive.Concurrency.UserWorkItem`1.Run() in /_/Rx.NET/Source/src/System.Reactive/Concurrency/UserWorkItem.cs:line 29
at System.Reactive.Concurrency.ThreadPoolScheduler.<>c__5`1.<Schedule>b__5_0(Object closureWorkItem) in /_/Rx.NET/Source/src/System.Reactive/Concurrency/ThreadPoolScheduler.cs:line 47
at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
at System.Threading.Thread.StartCallback()
My hack above didn't actually mitigate this: I'm seeing ArgumentNullException
exceptions thrown in other places afterwards.
I'll see if I can come up with a repro project.
It looks like this is also a problem for v7.0.0-preview.2, but it seems a bit more resilient.
I created several hundred custom resources, modified their status, and it seemed to run longer but it still eventually results in an ArgumentException
at:
// ResourceCache{TEntity}.cs: 104
private bool Exists(TEntity resource) => _cache.ContainsKey(resource.Metadata.Uid);
In theory, the condition that causes this is not possible. This is during re-processing an entity that had its event error or be requeued manually, which means the K8S API sent the entity over. In that case, it should always have a UID...
I'd appreciate if you could put a sample together because it almost sounds like an issue with K8S and I'm curious if I can reproduce the behavior.
So I've been playing with this some more and I noticed that my controller's ReconcileAsync()
method was returning a:
ResourceControllerResult.RequeueEvent(TimeSpan.FromSeconds(15));
for some reason. Removing this and just returning null
seems to fix this. So this may be what happening:
- KubeOps calls
ReconcileAsync()
with a resource -
ReconcileAsync()
patches theStatus
property - Then
ReconcileAsync()
returns aResourceControllerResult(TimeSpan.FromSeconds(15))
Does it even make sense for a controller to ask to requeue after changing the resource? (probably programmer error)
That workflow seems perfectly valid to me. Think of a situation where your controller is provisioning an external resource (say a VM on a hypervisor), if it comes time to reconcile the resource and you see the VM isn't ready (yet), you would want to check that again, thus telling the SDK to reconcile again in a short amount of time.
To me, having the SDK take issue with that is a concerning bug and I would be interested in a code sample where it can be reproduced. I'll try and piece it together tonight off of what you've given me, but I don't have a ton of time right now (and the issue could prove to be elusive).
OK: I'll try to create a simple repro project tomorrow morning (Sunday). It's actually sunny here in Seattle today and I'm going out for a drive. (I've been in the house all week)
FYI: I've been working on a simple repro without success for far. I'm going to investigate some more.
So, I'm not seeing this problem any more in my actual operator. I'm not sure what changed.
I'm going to keep this issue open for a few days in case I see this again, otherwise I'll close this.
Hey @jefflill, would you mind linking the forked repo here? Because this repository does not have a v6.5.3 official release. There is a 6.5.2 and 7.0.0-prerelease.2 tag right now.
My fork is here: https://github.com/neonforge-forks/dotnet-operator-sdk. I haven't made any changes to the fork.
So here's what happened:
- We were on an older release (not sure which) and we wanted to upgrade to pick up a later version of
KubernetesClient
. - I tried upgrading to the official v6.5.2 nuget package but was having trouble building. I couldn't figure this out, but there appeared to be a conflict between the new KubernetesClient Classic, KubeOps and DotnetKubernetesClient packages. I tried the usual: clearning nuget caches, deleting bin/obj folders, and clean rebuilds to no avail.
- Then I noticed that you have a v6.5.3 tag so I pulled that (via my fork) and rebuilt to try that locally and seemed to have some problems.
- I figured I'd try the v7.0.0-preview.2 package and that's working better.
I'm working on an extension class that allows me write applications that implement multiple control loops, each with their own leader election (like the Kubernetes controller manager does. This also maintains a collection of known resources for operators that need to perform operations on more than one resource at a time.
Frankly, I'm still wrapping my head around how operators work and my code is in progress, so any problems could have easily been my fault. Here's a link to what I have so far (including DEBUG logging):
https://github.com/nforgeio/neonKUBE/blob/master/Lib/Neon.Kube.Operator/ResourceManager.cs
I think it's pretty close now, just need to test after more refactoring.
Crap! I just started seeing this ArgumentNullException
again in my services using my extension.
I spent some time trying to repro this yesterday in a standalone project: https://github.com/nforgeio/support/tree/main/KubeOps/2022-06-12 without success (which seems to be working OK today). I'll have another look today.
WOHOO!!! I finally replicated this problem in a stand-alone test.
The problem happens when an exception is thrown in a controller's StatusModifiedAsync()
method. You can replicate this by:
- Configuring the current Kubernetes config for a running cluster
- Cloning my repo: https://github.com/nforgeio/support
- Opening the solution at:
$/KubeOps/22-06-12/TestKubeOps.sln
- Setting this command in your launch profile:
CreateModifyStatusException
- Configure
System.ArgumentNullException
to break into the debugger - Run the program
The program creates a new resource, modifies its status when reconciled and then throws an exception when StatusModifiedAsync()
is called. The program runs for a while and then ArgumentNullException
is thrown by the ContainsKey()
call as shown in the stack trace above.
NOTE: The test is referencing: KubeOps v7.0.0-preview.2
@buehler: this looks like an actual bug
That helps a lot; I'll see why it's doing that
I tried your directions and cannot reproduce. There is no solution in KubeOps/22-06-12 so I tried Kubeops/2022-06-12 and get to where an exception is thrown in StatusModified and then nothing happens.
...sorry, that was a typo
@erin-allison does that make sense? The concurrent directory should be able to handle this right? or did we screw up while refactoring the event queue?
I still cannot figure this issue out.
The source of the exception occurs at L223 in EventQueue
, because ResourceCache
cannot handle an entity passed with a null UID.
https://github.com/buehler/dotnet-operator-sdk/blob/704c99c52e6c1d5f0a8bfe08b5f0438e6b92cd00/src/KubeOps/Operator/Controller/EventQueue.cs#L208-L225
I can't find this as an issue with the SDK, however, as immediately prior to that call, the entity is returned from Kubernetes and null-checked. To me, this looks like a misbehaving (K8S) API server that is returning faulty objects for comparison, especially seeing as I cannot reproduce the behavior in any manner on a clean v1.24.1 installation.
Furthermore, I am hesitant to modify the cache logic as described in the original post as I am concerned that could end up masking other issues in the future by silently adjusting behavior when an invalid object is supplied. If anything, I would be inclined to more strictly check for a valid UID and throw a more descriptive error.
I guess to proceed from here, I'd like to see a dump of the value of resourceEvent and newResource in that lambda function at the time the exception is thrown.
... also maybe a log set to trace so we can see all of the behavior leading up to the exception, in case any of it turns out to be relevant.