javascript
javascript copied to clipboard
watcher connection stops receiving events after some time
It is a known issue that some people have worked around by using rather pull than push method for receiving updates about changes. Another workaround is to restart the watcher every n minutes. As @brendandburns pointed out in my earlier PR576, csharp k8s client suffers from the same problem: https://github.com/kubernetes-client/csharp/issues/533 .
My experience shows that it happens when the connection is idle for a long time. The connection is dropped without closing it so the client keeps waiting for events, not receiving any. I have seen it in Azure and Google cloud with managed k8s service.
The c# issue suggests that it happens because keepalives are not enabled on underlaying connection. And indeed, I found that it is the case for JS k8s client too. That can be fixed by adding keep-alive option to "request" options if there wasn't a bug in the request library. I have created a new ticket for it: https://github.com/request/request/issues/3367 . The request library has been deprecated and cannot be fixed. I was able to work around the bug in watcher's code. So with my fix, the connections are kept alive. My experience shows that every three minutes TCP ACK is exchanged between client and server. I would like the keep-alive to happen more often to detect dead watcher connections in more timely fashion however it does not seem to be possible to tweak keep-alive interval for the connection in nodejs: https://github.com/nodejs/node-v0.x-archive/issues/4109 .
The fix I have does not seem to fix the problem in all cases. That might be because the keep-alive of 3 minutes may not be sufficient in all cases. I will test the fix more thoroughly and update the ticket with the results of testing.
let me correct myself. The keep-alives shown by tcpdump were not keep-alives sent by the watcher but they are sent by the api server every 3 minutes:
...
11:32:59.239617 IP 227.114.246.35.bc.googleusercontent.com.https > gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832: Flags [.], ack 1, win 501, options [nop,nop,TS val 2612916151 ecr 3766451319], length 0
11:32:59.239753 IP gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832 > 227.114.246.35.bc.googleusercontent.com.https: Flags [.], ack 19557, win 331, options [nop,nop,TS val 3766631543 ecr 2611652359], length 0
11:35:59.463662 IP 227.114.246.35.bc.googleusercontent.com.https > gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832: Flags [.], ack 1, win 501, options [nop,nop,TS val 2613096375 ecr 3766631543], length 0
11:35:59.463762 IP gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832 > 227.114.246.35.bc.googleusercontent.com.https: Flags [.], ack 19557, win 331, options [nop,nop,TS val 3766811767 ecr 2611652359], length 0
Now comes the breakpoint roughly after half an hour since the connection has been established. The watcher receives a couple of bytes from the server and immediately starts to send keep-alives every second. That's when the connection becomes dead and does not receive any new events.
11:38:09.258810 IP 227.114.246.35.bc.googleusercontent.com.https > gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832: Flags [P.], seq 19557:19584, ack 1, win 501, options [nop,nop,TS val 2613226170 ecr 3766811767], length 27
11:38:09.258890 IP gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832 > 227.114.246.35.bc.googleusercontent.com.https: Flags [.], ack 19584, win 331, options [nop,nop,TS val 3766941562 ecr 2613226170], length 0
11:38:10.273184 IP gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832 > 227.114.246.35.bc.googleusercontent.com.https: Flags [.], ack 19584, win 331, options [nop,nop,TS val 3766942576 ecr 2613226170], length 0
11:38:10.273612 IP 227.114.246.35.bc.googleusercontent.com.https > gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832: Flags [.], ack 1, win 501, options [nop,nop,TS val 2613227185 ecr 3766941562], length 0
11:38:11.297176 IP gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832 > 227.114.246.35.bc.googleusercontent.com.https: Flags [.], ack 19584, win 331, options [nop,nop,TS val 3766943600 ecr 2613227185], length 0
11:38:11.297533 IP 227.114.246.35.bc.googleusercontent.com.https > gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832: Flags [.], ack 1, win 501, options [nop,nop,TS val 2613228209 ecr 3766941562], length 0
...
My keep-alive patch does not seem to change this behaviour. Either it's not the way to go or it is not complete and still missing something. For the reference this is the commit I have been testing: https://github.com/jkryl/javascript/commit/0fef4de1e7c1dacfa7da04f1e4f6e2047525c9f3#diff-8af1000c89b03ced37f439c61c5696c45e1e83a70cc07182feef6595123f0bad
In general, I think relying on watches to stick around for greater than N minutes (where N is fairly small) is probably not the best bet. I think adding a timeout if an event hasn't been received in N minutes (you can tune N depending on your workload) is probably the right approach.
The network is just too weird a place to expect that HTTP/TCP connections will stay alive for a long time.
yes, I tend to agree. It would be interesting to know what golang watcher does - given that it's kinda referential implementation for k8s watchers. I decided to work around the problem by doing exactly what you have suggested 🤞
Can you point out to how to set the timeout? Would that be on the list function, e.g.
listNamespacedPod(
namespace,
undefined, // pretty?: string,
undefined, // allowWatchBookmarks?: boolean,
undefined, // _continue?: string,
undefined, // fieldSelector?: string,
undefined, // labelSelector?: string,
undefined, // limit?: number,
undefined, // resourceVersion?: string,
undefined, // resourceVersionMatch?: string,
300, // timeoutSeconds?: number,
true, // watch?: boolean
)
@DocX I would place a timer in your own code, e.g. something like:
function maybeRestartInformer() {
if (noUpdates) {
restartInformer();
}
setTimeout(maybeRestartInformer, <timeout>);
}
setTimeout(maybeRestartInformer, <timeout>);
or something like that.
@jkryl the relevant code starts here:
https://github.com/kubernetes/client-go/blob/master/rest/request.go#L674
There's also a big discussion here:
https://github.com/kubernetes/client-go/issues/374
and here
https://github.com/kubernetes/kubernetes/issues/65012#issuecomment-410320028
The consensus seems to be switching to HTTP/2 and sending Pings from the client.
@brendandburns This might be slightly off the topic of this issue, but can you give any advice on how to properly restart an informer? Informers don't seem to have a stop()
function, and I'm worried that just re-calling start()
is going to cause some problems.
@DocX : As @brendanburns suggests I have a special code in my watcher wrapper that restarts the watcher after being idle for too long. Snippet from my current code:
_setIdleTimeout() {
if (this.idleTimeout > 0) {
this._clearTimer();
this.timer = setTimeout(() => {
this.stop();
this.start();
}, this.idleTimeout);
}
}
_onEvent(event: string, cr: CustomResource) {
...
this._setIdleTimeout();
...
}
Full implementation specific for my use case is here: https://github.com/openebs/Mayastor/blob/develop/csi/moac/watcher.ts#L225
Informers don't seem to have a stop() function, and I'm worried that just re-calling start() is going to cause some problems.
~~makeInformer
returns a ListWatch
which does have a stop()
method, which just works in JS world?~~
~~Quite possibly the interface Informer<T>
should be updated to expose that in TS world.~~
It seems that stop()
just makes the request abort, which in turn just restarts it.
Yes, there is a need for a proper stop method 👍
@dominykas that's a fair request. I filed https://github.com/kubernetes-client/javascript/issues/604 to track.
A PR would be welcome, or I will get to it eventually.
Thanks 🙇
I'll see if I can catch a break to PR this soon enough. Guidelines on how you'd approach it most welcome - I have my ideas, but not sure about alternatives.
I would add some sort of intercept/break in the timeout handler code path.
e.g.
if (aborted) { return; }
Or somesuch.
Not sure if that is relevant, but just FYI, specifying the timeout seconds on the list function as shared above, seems to have get rid of the problem for us.
@brendanburns We've seen a similar issue with Informer API in AKS. After a short period of time, depending on how big was the cluster, Informer stops receiving events and goes into ECONNRESET
loop, and then eventually becomes unresponsive.
I've raised a ticket with Azure support. They raised some findings back, might be helpful
- From the traces interestingly we saw lot of RST from the API Server directly without even having any handshake
- It seems there is no keepalive ? i.e. the client TCP socket will send a keepalive packet from time to time to make sure this is an active TCP stream . Servers usually discard inactive TCP streams with RST.
Indeed, there is no keepalive in Informer requests.
Hope this helps :)
Can you check and test whether https://github.com/kubernetes-client/javascript/pull/630 fixes your issue?
@dkontorovskyy I pushed an 0.14.2 release which contains @bacongobbler 's keep-alive fix.
i got this error exactly after 10 minutes (v 0.14.3) (OpenShift Version 4.5.35)
watch on resource pods.v1 failed: {"code":"ECONNRESET",
"name":"Error","message":"aborted","stack":"Error: aborted\n
at connResetException (node:internal/errors:642:14)\n
at TLSSocket.socketCloseListener (node:_http_client:424:27)\n
at TLSSocket.emit (node:events:381:22)\n at node:net:666:12\n
at TCP.done (node:_tls_wrap:577:7)"
and also for the informer
Error: aborted
at connResetException (node:internal/errors:642:14)
at TLSSocket.socketCloseListener (node:_http_client:424:27)
at TLSSocket.emit (node:events:381:22)
at node:net:666:12
at TCP.done (node:_tls_wrap:577:7) {
code: 'ECONNRESET'
}
node v15.9.0
We are experiencing the same issue after about ~1 hour (v0.14.3). setKeepalive
is probably not working due to https://github.com/request/request/issues/3367 not being merged yet. Is anyone having a suitable workaround until we get http2 pinging working?

We noticed a similar issue and verified it was solved by adding the close
handler to the request instead of the stream in Watch.watch
:
https://github.com/kubernetes-client/javascript/blob/master/src/watch.ts#L120
Alternatively, changing close
to end
on that line also fixed it, which we tried because of this code in request.js.
As a cause for the request closing, we thought it could be the --min-request-timeout
flag from kube-apiserver.
We upgrade node from version 12 to 15 and the issue seems to be gone. Not sure why, but maybe this helps :-)
Interesting find @mariusziemke. Can anyone else confirm upgrading to node 15 works for them as well?
Interesting find @mariusziemke. Can anyone else confirm upgrading to node 15 works for them as well?
We tried Node 16 and it also works...
We use Node 14.16.0
and have no issue with kuberntes-client 0.14.3
(also using watch, and with no additional timeout logic)
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten