dotnet-operator-sdk icon indicating copy to clipboard operation
dotnet-operator-sdk copied to clipboard

System.IO.IOException: The request was aborted in the KubeOps.Operator.Kubernetes.ResourceWatcher

Open alexander-klang opened this issue 2 years ago • 2 comments

Describe the bug The app stops watching designated resources after encountering the System.IO.IOException in the KubeOps.Operator.Kubernetes.ResourceWatcher.

To Reproduce It happens after running a pod for a while. Conditions are not clear to me.

Expected behavior Pod handles transient exceptions and continues to watch designated resources.

Additional context The callstack:

fail: KubeOps.Operator.Kubernetes.ResourceWatcher[0]
      There was an error while watching the resource "MyResource".
      System.IO.IOException: The request was aborted.
       ---> System.IO.IOException: The response ended prematurely while waiting for the next frame from the server.
         at System.Net.Http.Http2Connection.<ReadFrameAsync>g__ThrowMissingFrame|57_1()
         at System.Net.Http.Http2Connection.ReadFrameAsync(Boolean initialFrame)
         at System.Net.Http.Http2Connection.ProcessIncomingFramesAsync()
         --- End of inner exception stack trace ---
         at System.Net.Http.Http2Connection.ThrowRequestAborted(Exception innerException)
         at System.Net.Http.Http2Connection.Http2Stream.CheckResponseBodyState()
         at System.Net.Http.Http2Connection.Http2Stream.TryReadFromBuffer(Span`1 buffer, Boolean partOfSyncRead)
         at System.Net.Http.Http2Connection.Http2Stream.ReadDataAsync(Memory`1 buffer, HttpResponseMessage responseMessage, CancellationToken cancellationToken)
         at k8s.LineSeparatedHttpContent.CancelableStream.ReadAsync(Byte[] buffer, Int32 offset, Int32 count, CancellationToken cancellationToken)
         at System.IO.StreamReader.ReadBufferAsync(CancellationToken cancellationToken)
         at System.IO.StreamReader.ReadLineAsyncInternal()
         at k8s.Watcher`1.<>c.<CreateWatchEventEnumerator>b__21_1[TR](Task`1 t)
         at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
         at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
      --- End of stack trace from previous location ---
         at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
      --- End of stack trace from previous location ---
         at k8s.Watcher`1.CreateWatchEventEnumerator(Func`1 streamReaderCreator, Action`1 onError, CancellationToken cancellationToken)+MoveNext()
         at k8s.Watcher`1.CreateWatchEventEnumerator(Func`1 streamReaderCreator, Action`1 onError, CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()
         at k8s.Watcher`1.WatcherLoop(CancellationToken cancellationToken)
         at k8s.Watcher`1.WatcherLoop(CancellationToken cancellationToken)

Edit: KubeOps 6.6.0 k8s 1.21.14-gke.700

alexander-klang avatar Oct 05 '22 10:10 alexander-klang

It is happening for me as well. In my case im running: KubeOps 6.0.18 K8s 1.22.9

I've double checked and this issue is only happening on my AKS clusters, please take a look at this issue at official k8s dotnet client repo Watch Method stops watching. A workaround is suggested at the mentioned thread, Ill try it myself and come back with the results.

@alexander-klang Is by any chance your environment running on AKS as well?

Crespalvaro avatar Oct 17 '22 15:10 Crespalvaro

@Crespalvaro I'm running on GKE (also added version info to the original post). Checked the workaround - please let us know if it works!

ghost avatar Oct 18 '22 08:10 ghost

Hi @alexander-klang.

I've been doing multiple tests regarding the subject and unfortunatily I cannot come up with a definitive answer. However, I will still post here the results of my research so maybe we can arrange a solution together.

First of all, the hack suggested on the official c# client gh is only a mitigation strategy, It wont completely fix this issue, and so was for me. I've seen as well that this hack was integrated at version 7.0.13 and removed in version 7.1.x apparently in favour of a better strategy using k8s.Autorest as replacement for Microsoft.Rest. Thus, I dont think we should keep looking further in that direction.

I've also played a bit with the timeout configuration and reconcile requeue interval, seems that it helped although I'm still facing the issue.

Finally, I digged onto this repository code and find out that its already dealing with not only server closing the connection but also any exception thrown ResourceWatcher.OnException. This disposes the K8s Watcher and recreates it. I'm confused because OnException reconnecting with backoff strategy and OnClose does purely the same. The only differences I see on the code is that _reconnectHandler awaits WatchResource whereas OnClose calls RestartWatcher without await, thus it wont wait even though inside RestartWatcher it is as well calling await WatchResource

Find below some traces of the 2 cases (keep in mind traces are captured on grafana for in deep tracking thus they are upside down):

Server Close Connection

  |   | 2022-11-02 16:23:40 | Execute/Reconcile event "Reconcile" on resource "*****".
  |   | 2022-11-02 16:23:40 | dbug: KubeOps.Operator.Controller.ManagedResourceController[0]
  |   | 2022-11-02 16:23:40 | Mapped watch event to "Reconcile" for "*****"
  |   | 2022-11-02 16:23:40 | trce: KubeOps.Operator.Controller.ManagedResourceController[0]
  |   | 2022-11-02 16:23:40 | Mapping cache result "Other" for "******".
  |   | 2022-11-02 16:23:40 | trce: KubeOps.Operator.Controller.ManagedResourceController[0]
  |   | 2022-11-02 16:23:40 | Mapping watcher event "Added" for "*******".
  |   | 2022-11-02 16:23:40 | trce: KubeOps.Operator.Controller.ManagedResourceController[0]
  |   | 2022-11-02 16:23:40 | Received watch event "Added" for "*******".
  |   | 2022-11-02 16:23:40 | trce: KubeOps.Operator.Kubernetes.ResourceWatcher[0]
  |   | 2022-11-02 16:23:40 | Restarting resource watcher for type "******".
  |   | 2022-11-02 16:23:40 | trce: KubeOps.Operator.Kubernetes.ResourceWatcher[0]
  |   | 2022-11-02 16:23:40 | The server closed the connection. Trying to reconnect.`

Conn Reset Exception Catched

  |   | 2022-11-01 09:19:44 | Connection id "0HMLR4QKDK1P0" accepted.
  |   | 2022-11-01 09:19:44 | dbug: Microsoft.AspNetCore.Server.Kestrel[39]
  |   | 2022-11-01 09:19:40 | Received watch event "Added" for "******".
  |   | 2022-11-01 09:19:40 | trce: KubeOps.Operator.Kubernetes.ResourceWatcher[0]
  |   | 2022-11-01 09:19:37 | Trying to reconnect with exponential backoff 00:00:02.9860000.
  |   | 2022-11-01 09:19:37 | info: KubeOps.Operator.Kubernetes.ResourceWatcher[0]
  |   | 2022-11-01 09:19:37 | at k8s.Watcher`1.WatcherLoop(CancellationToken cancellationToken)
  |   | 2022-11-01 09:19:37 | at k8s.Watcher`1.WatcherLoop(CancellationToken cancellationToken)
  |   | 2022-11-01 09:19:37 | at k8s.Watcher`1.CreateWatchEventEnumerator(Func`1 streamReaderCreator, Action`1 onError, CancellationToken cancellationToken)+MoveNext()
  |   | 2022-11-01 09:19:37 | --- End of stack trace from previous location ---
  |   | 2022-11-01 09:19:37 | at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
  |   | 2022-11-01 09:19:37 | --- End of stack trace from previous location ---
  |   | 2022-11-01 09:19:37 | at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
  |   | 2022-11-01 09:19:37 | at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
  |   | 2022-11-01 09:19:37 | at k8s.Watcher`1.<>c.<CreateWatchEventEnumerator>b__21_1[TR](Task`1 t)
  |   | 2022-11-01 09:19:37 | at System.IO.StreamReader.ReadLineAsyncInternal()
  |   | 2022-11-01 09:19:37 | at System.IO.StreamReader.ReadBufferAsync(CancellationToken cancellationToken)
  |   | 2022-11-01 09:19:37 | at k8s.LineSeparatedHttpContent.CancelableStream.ReadAsync(Byte[] buffer, Int32 offset, Int32 count, CancellationToken cancellationToken)
  |   | 2022-11-01 09:19:37 | at System.Net.Http.Http2Connection.Http2Stream.ReadDataAsync(Memory`1 buffer, HttpResponseMessage responseMessage, CancellationToken cancellationToken)
  |   | 2022-11-01 09:19:37 | at System.Net.Http.Http2Connection.Http2Stream.TryReadFromBuffer(Span`1 buffer, Boolean partOfSyncRead)
  |   | 2022-11-01 09:19:37 | at System.Net.Http.Http2Connection.Http2Stream.CheckResponseBodyState()
  |   | 2022-11-01 09:19:37 | at System.Net.Http.Http2Connection.ThrowRequestAborted(Exception innerException)
  |   | 2022-11-01 09:19:37 | --- End of inner exception stack trace ---
  |   | 2022-11-01 09:19:37 | at System.Net.Http.Http2Connection.ProcessIncomingFramesAsync()
  |   | 2022-11-01 09:19:37 | at System.Net.Security.SslStream.ReadAsyncInternal[TIOAdapter](TIOAdapter adapter, Memory`1 buffer)
  |   | 2022-11-01 09:19:37 | at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token)
  |   | 2022-11-01 09:19:37 | at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
  |   | 2022-11-01 09:19:37 | --- End of inner exception stack trace ---
  |   | 2022-11-01 09:19:37 | ---> System.Net.Sockets.SocketException (104): Connection reset by peer
  |   | 2022-11-01 09:19:37 | ---> System.IO.IOException: Unable to read data from the transport connection: Connection reset by peer.
  |   | 2022-11-01 09:19:37 | System.IO.IOException: The request was aborted.
  |   | 2022-11-01 09:19:37 | There was an error while watching the resource "*******".
  |   | 2022-11-01 09:19:37 | fail: KubeOps.Operator.Kubernetes.ResourceWatcher[0]

As you can see, OnClose successfully recovers whereas OnException successfully restarts the Watcher and catchs a new "Received watch event "Added"" without propagating it to EventQueue Event Mapping

It is frustrating because it seems like we are properly dealing with the error but somehow there is something else that remains broken and I could not identify it. @buehler Does this investigation gives you an idea of what could be wrong?

Hope that his is helpful and that I could explain my self in an understandable way.

Crespalvaro avatar Nov 02 '22 16:11 Crespalvaro

We observe the same problem on our operator with KubeOps 5.4.1 as well as the latest version KubeOps 6.6.1 on Kubernetes 1.22. It is really difficult to debug as it is of such a sporadic nature, but it is a real show stopper for us as the users of our operator rely on its stability without such manual interaction as restarts. Is there any progress on the issue? Thank you very much to all who contributed to the investigation already!

anekdoti avatar Jan 11 '23 12:01 anekdoti

This is virtually impossible to retrace. When Kubernetes (or the API of it) closes the connection, you can do nothing. However, maybe we should actively let the server crash after some retries and trigger a redeployment of the pod... but that seems to be a rather harsh treatment...

buehler avatar Jan 11 '23 14:01 buehler

I wonder how this is done e.g. in the Golang Operator SDK? I agree it sounds like a harsh treatment, but it would still be better than having an operator that does not reconcile anymore... especially, in cases like ours where we provide the operator to be used by others on their Kubernetes cluster.

anekdoti avatar Jan 11 '23 14:01 anekdoti

@anekdoti The best solution for us was to implement readiness and liveness probes. Basically when server closes the connection, the watcher remains broken and reconciles stop ocurring.

Everything in my prior investigation helps mitigating the ocurrences, still this health checks will restart automatically the operator whenever it happens, fixing the issue.

An implementation example could be as follows (moreover it is the one I implemented), set time/timestamp on the status of your custom resources every reconcile, health checks will basically get the resources and check if it has been over x time (in my case 10 minutes) idle (no updates). At this point you can play with failureThreshold (max retries), periodSeconds (execution frecuency), etc... check docs if needed Liveness, Readiness Probes

I hope this helps you and anyone facing this issue now or in the future.

Crespalvaro avatar Jan 11 '23 15:01 Crespalvaro

Hi, I think that I really found out what "terminates" the reconnect subscription. More precisely the publisher (IObserver<TimeSpan>part of the ISubject<TimeSpan>) stops emitting new values into the pipeline and the IObservable<Timespan> remains idle. This resetting logic of the _reconnectAttempts variable resets it only in case of successful reconnections.

Occasionally (or is it rather exactly after 40 consecutively failed reconnection attempts?) the _reconnectAttempts reaches the threshold of 40 retries and then as depicted bellow an unhandled System.OverflowException is thrown from the _settings.ErrorBackoffStrategy (ExponentialBackoffStrategy) delegate:

image

After this the _reconnectHandler.OnNext(backoff); is not called anymore to retrigger a new Observable.Timer with backoff. Here is an illustrative fix:

const int MaxRetriesAttempts = 39;

++_reconnectAttempts;

_logger.LogError(e, @"There was an error while watching the resource ""{resource}"".", typeof(TEntity));
var backoff = _settings.ErrorBackoffStrategy(_reconnectAttempts > MaxRetriesAttempts ? MaxRetriesAttempts : _reconnectAttempts);
if (backoff.TotalSeconds > _settings.WatcherMaxRetrySeconds)
{
    backoff = TimeSpan.FromSeconds(_settings.WatcherMaxRetrySeconds);
}

Of course adjust it to fit the library the best. Would it make sense to also surround some parts of the OnException method with a try catch block to ensure that the OnNext is called?

try
{
    //...
}
catch (Exception exception)
{
    _logger.LogError(exception, "Failed...");
}
finally
{
    _reconnectHandler.OnNext(backoff);
}

The end result of the above described issue is that the operator stops watching a resource of type TEntity.

Thank you in advance!

tomasfabian avatar Jan 13 '23 11:01 tomasfabian

Hello @Crespalvaro and @tomasfabian !

First, thank you very much, @Crespalvaro for sharing your workaround. We were actually about to implement our interpretation of it, until @tomasfabian shared his analysis with me (thank you!). We actually had a call about it, and I think @tomasfabian is on to something - and the problem described is so subtil that I could easily imagine how it did not got caught earlier on.

What do you think (also @buehler)?

anekdoti avatar Jan 13 '23 12:01 anekdoti

Wow..! Thank you all for the investigative work!

I think the issue with the backoff strategy seems legit.

buehler avatar Jan 13 '23 15:01 buehler

Hi @buehler @anekdoti @alexander-klang @Crespalvaro and everyone else!

I prepared a PR which should fix the issue regarding the stopped reconciliations.

Thank you @anekdoti for leading me toward the provided fix during our call!

tomasfabian avatar Jan 16 '23 11:01 tomasfabian