dotnet-operator-sdk icon indicating copy to clipboard operation
dotnet-operator-sdk copied to clipboard

ArgumentNullException in cache timer

Open jefflill opened this issue 2 years ago • 20 comments

Describe the bug I'm running with an operator using a forked repo with the v6.5.3 tag checked out so debugging will be easier.

After letting operator run for several minutes it crashes with an unhandled ArgumentNullException being thrown from a timer doing something with your resource cache. It's failing here:

// ResourceCache{TEntity}.cs: 104

private bool Exists(TEntity resource) => _cache.ContainsKey(resource.Metadata.Uid);

I hacked around this by adding a string.IsNullOrEmpty() check:

private bool Exists(TEntity resource) => !string.IsNullOrEmpty(resource.Metadata.Uid) && _cache.ContainsKey(resource.Metadata.Uid);

To Reproduce This isn't super reproducable but this seems to happen after the controller's StatusModifiedAsync() method is called. I created 20 custom resources with status and then patch the status when each resource is reconciled. Sometimes an ArgumentException is thrown.

Expected behavior Expected no crash.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Here's the stack trace:

Unhandled exception. System.ArgumentNullException: Value cannot be null. (Parameter 'key')
   at System.ThrowHelper.ThrowArgumentNullException(String name)
   at System.Collections.Concurrent.ConcurrentDictionary`2.ContainsKey(TKey key)
   at KubeOps.Operator.Caching.ResourceCache`1.Exists(TEntity resource) in C:\src\neonKUBE\Lib-forked\KubeOps\src\KubeOps\Operator\Caching\ResourceCache{TEntity}.cs:line 104
   at KubeOps.Operator.Caching.ResourceCache`1.CompareCache(TEntity resource) in C:\src\neonKUBE\Lib-forked\KubeOps\src\KubeOps\Operator\Caching\ResourceCache{TEntity}.cs:line 79
   at KubeOps.Operator.Caching.ResourceCache`1.Upsert(TEntity resource, CacheComparisonResult& result) in C:\src\neonKUBE\Lib-forked\KubeOps\src\KubeOps\Operator\Caching\ResourceCache{TEntity}.cs:line 36
   at KubeOps.Operator.Controller.EventQueue`1.<>c__DisplayClass19_0.<<UpdateResourceData>b__0>d.MoveNext() in C:\src\neonKUBE\Lib-forked\KubeOps\src\KubeOps\Operator\Controller\EventQueue.cs:line 223
--- End of stack trace from previous location ---
   at System.Reactive.PlatformServices.ExceptionServicesImpl.Rethrow(Exception exception) in /_/Rx.NET/Source/src/System.Reactive/Internal/ExceptionServicesImpl.cs:line 19
   at System.Reactive.ExceptionHelpers.Throw(Exception exception) in /_/Rx.NET/Source/src/System.Reactive/Internal/ExceptionServices.cs:line 16
   at System.Reactive.Stubs.<>c.<.cctor>b__2_1(Exception ex) in /_/Rx.NET/Source/src/System.Reactive/Internal/Stubs.cs:line 16
   at System.Reactive.AnonymousObserver`1.OnErrorCore(Exception error) in /_/Rx.NET/Source/src/System.Reactive/AnonymousObserver.cs:line 73
   at System.Reactive.ObserverBase`1.OnError(Exception error) in /_/Rx.NET/Source/src/System.Reactive/ObserverBase.cs:line 59
   at System.Reactive.Linq.ObservableImpl.ConcatMany`1.ConcatManyOuterObserver.Drain() in /_/Rx.NET/Source/src/System.Reactive/Linq/Observable/ConcatMany.cs:line 149
   at System.Reactive.Linq.ObservableImpl.ConcatMany`1.ConcatManyOuterObserver.InnerComplete() in /_/Rx.NET/Source/src/System.Reactive/Linq/Observable/ConcatMany.cs:line 119
   at System.Reactive.Linq.ObservableImpl.ConcatMany`1.ConcatManyOuterObserver.InnerObserver.OnCompleted() in /_/Rx.NET/Source/src/System.Reactive/Linq/Observable/ConcatMany.cs:line 214
   at System.Reactive.Sink`1.ForwardOnCompleted() in /_/Rx.NET/Source/src/System.Reactive/Internal/Sink.cs:line 54
   at System.Reactive.Sink`2.OnCompleted() in /_/Rx.NET/Source/src/System.Reactive/Internal/Sink.cs:line 96
   at System.Reactive.Threading.Tasks.TaskObservableExtensions.EmitTaskResult(Task task, IObserver`1 subject) in /_/Rx.NET/Source/src/System.Reactive/Threading/Tasks/TaskObservableExtensions.cs:line 161
   at System.Reactive.Threading.Tasks.TaskObservableExtensions.SlowTaskObservable.<>c.<Subscribe>b__3_2(ValueTuple`2 tuple2) in /_/Rx.NET/Source/src/System.Reactive/Threading/Tasks/TaskObservableExtensions.cs:line 49
   at System.Reactive.Concurrency.Scheduler.<>c__75`1.<ScheduleAction>b__75_0(IScheduler _, ValueTuple`2 tuple) in /_/Rx.NET/Source/src/System.Reactive/Concurrency/Scheduler.Simple.cs:line 65
   at System.Reactive.Concurrency.UserWorkItem`1.Run() in /_/Rx.NET/Source/src/System.Reactive/Concurrency/UserWorkItem.cs:line 29
   at System.Reactive.Concurrency.ThreadPoolScheduler.<>c__5`1.<Schedule>b__5_0(Object closureWorkItem) in /_/Rx.NET/Source/src/System.Reactive/Concurrency/ThreadPoolScheduler.cs:line 47
   at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()

jefflill avatar Jun 11 '22 20:06 jefflill

My hack above didn't actually mitigate this: I'm seeing ArgumentNullException exceptions thrown in other places afterwards.

I'll see if I can come up with a repro project.

jefflill avatar Jun 11 '22 21:06 jefflill

It looks like this is also a problem for v7.0.0-preview.2, but it seems a bit more resilient.

I created several hundred custom resources, modified their status, and it seemed to run longer but it still eventually results in an ArgumentException at:

// ResourceCache{TEntity}.cs: 104

private bool Exists(TEntity resource) => _cache.ContainsKey(resource.Metadata.Uid);

jefflill avatar Jun 11 '22 21:06 jefflill

In theory, the condition that causes this is not possible. This is during re-processing an entity that had its event error or be requeued manually, which means the K8S API sent the entity over. In that case, it should always have a UID...

erin-allison avatar Jun 11 '22 22:06 erin-allison

I'd appreciate if you could put a sample together because it almost sounds like an issue with K8S and I'm curious if I can reproduce the behavior.

erin-allison avatar Jun 11 '22 22:06 erin-allison

So I've been playing with this some more and I noticed that my controller's ReconcileAsync() method was returning a:

ResourceControllerResult.RequeueEvent(TimeSpan.FromSeconds(15));

for some reason. Removing this and just returning null seems to fix this. So this may be what happening:

  1. KubeOps calls ReconcileAsync() with a resource
  2. ReconcileAsync() patches the Status property
  3. Then ReconcileAsync() returns a ResourceControllerResult(TimeSpan.FromSeconds(15))

Does it even make sense for a controller to ask to requeue after changing the resource? (probably programmer error)

jefflill avatar Jun 11 '22 22:06 jefflill

That workflow seems perfectly valid to me. Think of a situation where your controller is provisioning an external resource (say a VM on a hypervisor), if it comes time to reconcile the resource and you see the VM isn't ready (yet), you would want to check that again, thus telling the SDK to reconcile again in a short amount of time.

To me, having the SDK take issue with that is a concerning bug and I would be interested in a code sample where it can be reproduced. I'll try and piece it together tonight off of what you've given me, but I don't have a ton of time right now (and the issue could prove to be elusive).

erin-allison avatar Jun 11 '22 22:06 erin-allison

OK: I'll try to create a simple repro project tomorrow morning (Sunday). It's actually sunny here in Seattle today and I'm going out for a drive. (I've been in the house all week)

jefflill avatar Jun 11 '22 22:06 jefflill

FYI: I've been working on a simple repro without success for far. I'm going to investigate some more.

jefflill avatar Jun 12 '22 23:06 jefflill

So, I'm not seeing this problem any more in my actual operator. I'm not sure what changed.

I'm going to keep this issue open for a few days in case I see this again, otherwise I'll close this.

jefflill avatar Jun 13 '22 02:06 jefflill

Hey @jefflill, would you mind linking the forked repo here? Because this repository does not have a v6.5.3 official release. There is a 6.5.2 and 7.0.0-prerelease.2 tag right now.

buehler avatar Jun 13 '22 07:06 buehler

My fork is here: https://github.com/neonforge-forks/dotnet-operator-sdk. I haven't made any changes to the fork.

So here's what happened:

  1. We were on an older release (not sure which) and we wanted to upgrade to pick up a later version of KubernetesClient.
  2. I tried upgrading to the official v6.5.2 nuget package but was having trouble building. I couldn't figure this out, but there appeared to be a conflict between the new KubernetesClient Classic, KubeOps and DotnetKubernetesClient packages. I tried the usual: clearning nuget caches, deleting bin/obj folders, and clean rebuilds to no avail.
  3. Then I noticed that you have a v6.5.3 tag so I pulled that (via my fork) and rebuilt to try that locally and seemed to have some problems.
  4. I figured I'd try the v7.0.0-preview.2 package and that's working better.

I'm working on an extension class that allows me write applications that implement multiple control loops, each with their own leader election (like the Kubernetes controller manager does. This also maintains a collection of known resources for operators that need to perform operations on more than one resource at a time.

Frankly, I'm still wrapping my head around how operators work and my code is in progress, so any problems could have easily been my fault. Here's a link to what I have so far (including DEBUG logging):

https://github.com/nforgeio/neonKUBE/blob/master/Lib/Neon.Kube.Operator/ResourceManager.cs

I think it's pretty close now, just need to test after more refactoring.

jefflill avatar Jun 13 '22 18:06 jefflill

Crap! I just started seeing this ArgumentNullException again in my services using my extension.

I spent some time trying to repro this yesterday in a standalone project: https://github.com/nforgeio/support/tree/main/KubeOps/2022-06-12 without success (which seems to be working OK today). I'll have another look today.

jefflill avatar Jun 13 '22 19:06 jefflill

WOHOO!!! I finally replicated this problem in a stand-alone test.

The problem happens when an exception is thrown in a controller's StatusModifiedAsync() method. You can replicate this by:

  1. Configuring the current Kubernetes config for a running cluster
  2. Cloning my repo: https://github.com/nforgeio/support
  3. Opening the solution at: $/KubeOps/22-06-12/TestKubeOps.sln
  4. Setting this command in your launch profile: CreateModifyStatusException
  5. Configure System.ArgumentNullException to break into the debugger
  6. Run the program

The program creates a new resource, modifies its status when reconciled and then throws an exception when StatusModifiedAsync() is called. The program runs for a while and then ArgumentNullException is thrown by the ContainsKey() call as shown in the stack trace above.

NOTE: The test is referencing: KubeOps v7.0.0-preview.2

jefflill avatar Jun 13 '22 20:06 jefflill

@buehler: this looks like an actual bug

jefflill avatar Jun 14 '22 21:06 jefflill

That helps a lot; I'll see why it's doing that

erin-allison avatar Jun 14 '22 21:06 erin-allison

I tried your directions and cannot reproduce. There is no solution in KubeOps/22-06-12 so I tried Kubeops/2022-06-12 and get to where an exception is thrown in StatusModified and then nothing happens.

erin-allison avatar Jun 14 '22 22:06 erin-allison

...sorry, that was a typo

jefflill avatar Jun 15 '22 21:06 jefflill

@erin-allison does that make sense? The concurrent directory should be able to handle this right? or did we screw up while refactoring the event queue?

buehler avatar Jun 16 '22 06:06 buehler

I still cannot figure this issue out.

The source of the exception occurs at L223 in EventQueue, because ResourceCache cannot handle an entity passed with a null UID.

https://github.com/buehler/dotnet-operator-sdk/blob/704c99c52e6c1d5f0a8bfe08b5f0438e6b92cd00/src/KubeOps/Operator/Controller/EventQueue.cs#L208-L225

I can't find this as an issue with the SDK, however, as immediately prior to that call, the entity is returned from Kubernetes and null-checked. To me, this looks like a misbehaving (K8S) API server that is returning faulty objects for comparison, especially seeing as I cannot reproduce the behavior in any manner on a clean v1.24.1 installation.

Furthermore, I am hesitant to modify the cache logic as described in the original post as I am concerned that could end up masking other issues in the future by silently adjusting behavior when an invalid object is supplied. If anything, I would be inclined to more strictly check for a valid UID and throw a more descriptive error.

erin-allison avatar Jun 16 '22 12:06 erin-allison

I guess to proceed from here, I'd like to see a dump of the value of resourceEvent and newResource in that lambda function at the time the exception is thrown.

... also maybe a log set to trace so we can see all of the behavior leading up to the exception, in case any of it turns out to be relevant.

erin-allison avatar Jun 16 '22 12:06 erin-allison