javascript watcher connection stops receiving events after some time

It is a known issue that some people have worked around by using rather pull than push method for receiving updates about changes. Another workaround is to restart the watcher every n minutes. As @brendandburns pointed out in my earlier PR576, csharp k8s client suffers from the same problem: https://github.com/kubernetes-client/csharp/issues/533 .

My experience shows that it happens when the connection is idle for a long time. The connection is dropped without closing it so the client keeps waiting for events, not receiving any. I have seen it in Azure and Google cloud with managed k8s service.

The c# issue suggests that it happens because keepalives are not enabled on underlaying connection. And indeed, I found that it is the case for JS k8s client too. That can be fixed by adding keep-alive option to "request" options if there wasn't a bug in the request library. I have created a new ticket for it: https://github.com/request/request/issues/3367 . The request library has been deprecated and cannot be fixed. I was able to work around the bug in watcher's code. So with my fix, the connections are kept alive. My experience shows that every three minutes TCP ACK is exchanged between client and server. I would like the keep-alive to happen more often to detect dead watcher connections in more timely fashion however it does not seem to be possible to tweak keep-alive interval for the connection in nodejs: https://github.com/nodejs/node-v0.x-archive/issues/4109 .

The fix I have does not seem to fix the problem in all cases. That might be because the keep-alive of 3 minutes may not be sufficient in all cases. I will test the fix more thoroughly and update the ticket with the results of testing.

Feb 25 '21 09:02 jkryl

let me correct myself. The keep-alives shown by tcpdump were not keep-alives sent by the watcher but they are sent by the api server every 3 minutes:

...
11:32:59.239617 IP 227.114.246.35.bc.googleusercontent.com.https > gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832: Flags [.], ack 1, win 501, options [nop,nop,TS val 2612916151 ecr 3766451319], length 0
11:32:59.239753 IP gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832 > 227.114.246.35.bc.googleusercontent.com.https: Flags [.], ack 19557, win 331, options [nop,nop,TS val 3766631543 ecr 2611652359], length 0
11:35:59.463662 IP 227.114.246.35.bc.googleusercontent.com.https > gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832: Flags [.], ack 1, win 501, options [nop,nop,TS val 2613096375 ecr 3766631543], length 0
11:35:59.463762 IP gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832 > 227.114.246.35.bc.googleusercontent.com.https: Flags [.], ack 19557, win 331, options [nop,nop,TS val 3766811767 ecr 2611652359], length 0

Now comes the breakpoint roughly after half an hour since the connection has been established. The watcher receives a couple of bytes from the server and immediately starts to send keep-alives every second. That's when the connection becomes dead and does not receive any new events.

11:38:09.258810 IP 227.114.246.35.bc.googleusercontent.com.https > gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832: Flags [P.], seq 19557:19584, ack 1, win 501, options [nop,nop,TS val 2613226170 ecr 3766811767], length 27
11:38:09.258890 IP gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832 > 227.114.246.35.bc.googleusercontent.com.https: Flags [.], ack 19584, win 331, options [nop,nop,TS val 3766941562 ecr 2613226170], length 0
11:38:10.273184 IP gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832 > 227.114.246.35.bc.googleusercontent.com.https: Flags [.], ack 19584, win 331, options [nop,nop,TS val 3766942576 ecr 2613226170], length 0
11:38:10.273612 IP 227.114.246.35.bc.googleusercontent.com.https > gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832: Flags [.], ack 1, win 501, options [nop,nop,TS val 2613227185 ecr 3766941562], length 0
11:38:11.297176 IP gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832 > 227.114.246.35.bc.googleusercontent.com.https: Flags [.], ack 19584, win 331, options [nop,nop,TS val 3766943600 ecr 2613227185], length 0
11:38:11.297533 IP 227.114.246.35.bc.googleusercontent.com.https > gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832: Flags [.], ack 1, win 501, options [nop,nop,TS val 2613228209 ecr 3766941562], length 0
...

My keep-alive patch does not seem to change this behaviour. Either it's not the way to go or it is not complete and still missing something. For the reference this is the commit I have been testing: https://github.com/jkryl/javascript/commit/0fef4de1e7c1dacfa7da04f1e4f6e2047525c9f3#diff-8af1000c89b03ced37f439c61c5696c45e1e83a70cc07182feef6595123f0bad

Feb 25 '21 13:02 jkryl

In general, I think relying on watches to stick around for greater than N minutes (where N is fairly small) is probably not the best bet. I think adding a timeout if an event hasn't been received in N minutes (you can tune N depending on your workload) is probably the right approach.

The network is just too weird a place to expect that HTTP/TCP connections will stay alive for a long time.

Feb 26 '21 05:02 brendandburns

yes, I tend to agree. It would be interesting to know what golang watcher does - given that it's kinda referential implementation for k8s watchers. I decided to work around the problem by doing exactly what you have suggested 🤞

Mar 02 '21 20:03 jkryl

Can you point out to how to set the timeout? Would that be on the list function, e.g.

listNamespacedPod(
  namespace, 
  undefined, // pretty?: string, 
  undefined, // allowWatchBookmarks?: boolean, 
  undefined, // _continue?: string, 
  undefined, // fieldSelector?: string,
  undefined, // labelSelector?: string, 
  undefined, // limit?: number, 
  undefined, // resourceVersion?: string, 
  undefined, // resourceVersionMatch?: string,
  300, // timeoutSeconds?: number, 
  true, // watch?: boolean
)

Mar 04 '21 17:03 DocX

@DocX I would place a timer in your own code, e.g. something like:

function maybeRestartInformer() {
    if (noUpdates) {
        restartInformer();
   }
   setTimeout(maybeRestartInformer, <timeout>);
}

setTimeout(maybeRestartInformer, <timeout>);

or something like that.

Mar 06 '21 21:03 brendandburns

@jkryl the relevant code starts here:

https://github.com/kubernetes/client-go/blob/master/rest/request.go#L674

There's also a big discussion here:

https://github.com/kubernetes/client-go/issues/374

and here

https://github.com/kubernetes/kubernetes/issues/65012#issuecomment-410320028

The consensus seems to be switching to HTTP/2 and sending Pings from the client.

Mar 06 '21 21:03 brendandburns

@brendandburns This might be slightly off the topic of this issue, but can you give any advice on how to properly restart an informer? Informers don't seem to have a stop() function, and I'm worried that just re-calling start() is going to cause some problems.

Mar 07 '21 23:03 jmickey

@DocX : As @brendanburns suggests I have a special code in my watcher wrapper that restarts the watcher after being idle for too long. Snippet from my current code:

 _setIdleTimeout() {
    if (this.idleTimeout > 0) {
      this._clearTimer();
      this.timer = setTimeout(() => {
        this.stop();
        this.start();
      }, this.idleTimeout);
    }
  }
  
   _onEvent(event: string, cr: CustomResource) {
     ...
     this._setIdleTimeout();
     ...
   }

Full implementation specific for my use case is here: https://github.com/openebs/Mayastor/blob/develop/csi/moac/watcher.ts#L225

Mar 08 '21 22:03 jkryl

Informers don't seem to have a stop() function, and I'm worried that just re-calling start() is going to cause some problems.

~~makeInformer returns a ListWatch which does have a stop() method, which just works in JS world?~~

~~Quite possibly the interface Informer<T> should be updated to expose that in TS world.~~

It seems that stop() just makes the request abort, which in turn just restarts it.

Yes, there is a need for a proper stop method 👍

Mar 09 '21 08:03 dominykas

@dominykas that's a fair request. I filed https://github.com/kubernetes-client/javascript/issues/604 to track.

A PR would be welcome, or I will get to it eventually.

Mar 10 '21 19:03 brendandburns

Thanks 🙇

I'll see if I can catch a break to PR this soon enough. Guidelines on how you'd approach it most welcome - I have my ideas, but not sure about alternatives.

Mar 11 '21 08:03 dominykas

I would add some sort of intercept/break in the timeout handler code path.

e.g.

if (aborted) { return; }

Or somesuch.

Mar 11 '21 16:03 brendandburns

Not sure if that is relevant, but just FYI, specifying the timeout seconds on the list function as shared above, seems to have get rid of the problem for us.

Mar 12 '21 13:03 DocX

@brendanburns We've seen a similar issue with Informer API in AKS. After a short period of time, depending on how big was the cluster, Informer stops receiving events and goes into ECONNRESET loop, and then eventually becomes unresponsive.

I've raised a ticket with Azure support. They raised some findings back, might be helpful

From the traces interestingly we saw lot of RST from the API Server directly without even having any handshake
It seems there is no keepalive ? i.e. the client TCP socket will send a keepalive packet from time to time to make sure this is an active TCP stream . Servers usually discard inactive TCP streams with RST.

Indeed, there is no keepalive in Informer requests.

Hope this helps :)

Mar 17 '21 16:03 dkontorovskyy

Can you check and test whether https://github.com/kubernetes-client/javascript/pull/630 fixes your issue?

Apr 08 '21 01:04 bacongobbler

@dkontorovskyy I pushed an 0.14.2 release which contains @bacongobbler 's keep-alive fix.

Apr 08 '21 02:04 brendandburns

i got this error exactly after 10 minutes (v 0.14.3) (OpenShift Version 4.5.35)

watch on resource pods.v1 failed: {"code":"ECONNRESET",
"name":"Error","message":"aborted","stack":"Error: aborted\n    
at connResetException (node:internal/errors:642:14)\n    
at TLSSocket.socketCloseListener (node:_http_client:424:27)\n    
at TLSSocket.emit (node:events:381:22)\n    at node:net:666:12\n    
at TCP.done (node:_tls_wrap:577:7)"

and also for the informer

Error: aborted
    at connResetException (node:internal/errors:642:14)
    at TLSSocket.socketCloseListener (node:_http_client:424:27)
    at TLSSocket.emit (node:events:381:22)
    at node:net:666:12
    at TCP.done (node:_tls_wrap:577:7) {
  code: 'ECONNRESET'
}

node v15.9.0

Apr 12 '21 17:04 huineng

We are experiencing the same issue after about ~1 hour (v0.14.3). setKeepalive is probably not working due to https://github.com/request/request/issues/3367 not being merged yet. Is anyone having a suitable workaround until we get http2 pinging working?

Apr 30 '21 11:04 mariusziemke

We noticed a similar issue and verified it was solved by adding the close handler to the request instead of the stream in Watch.watch: https://github.com/kubernetes-client/javascript/blob/master/src/watch.ts#L120 Alternatively, changing close to end on that line also fixed it, which we tried because of this code in request.js. As a cause for the request closing, we thought it could be the --min-request-timeout flag from kube-apiserver.

May 03 '21 10:05 hermansje

We upgrade node from version 12 to 15 and the issue seems to be gone. Not sure why, but maybe this helps :-)

May 03 '21 10:05 mariusziemke

Interesting find @mariusziemke. Can anyone else confirm upgrading to node 15 works for them as well?

May 11 '21 19:05 bacongobbler

Interesting find @mariusziemke. Can anyone else confirm upgrading to node 15 works for them as well?

We tried Node 16 and it also works...

May 13 '21 10:05 ivanstanev

We use Node 14.16.0 and have no issue with kuberntes-client 0.14.3 (also using watch, and with no additional timeout logic)

May 17 '21 11:05 DocX

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Aug 15 '21 18:08 k8s-triage-robot

/remove-lifecycle stale

Aug 15 '21 18:08 entropitor

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Nov 13 '21 19:11 k8s-triage-robot

/remove-lifecycle stale

Nov 14 '21 13:11 entropitor

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 12 '22 14:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Mar 14 '22 15:03 k8s-triage-robot

/remove-lifecycle rotten

Mar 31 '22 14:03 chengjianhua

javascript javascript copied to clipboard

watcher connection stops receiving events after some time

javascript
javascript copied to clipboard