javascript icon indicating copy to clipboard operation
javascript copied to clipboard

ECONNRESET error from kubernetes watch after some minutes.

Open jimjaeger opened this issue 2 years ago • 7 comments

Describe the bug If I use the kubernetes watch to listen to resource changes I get ECONNRESET and the watch stops. Is there any chance that the watch can handle underlaying connections errors and restart on his own?

** Client Version ** 0.20.0

To Reproduce Steps to reproduce the behavior:

  1. start a watch and wait longer than the setTimeout or setKeepAlive setting in the Watch config.

Expected behavior A watch runs without connection issues.

** Example Code**

function waitForPodCompletion(log: Context['log'], k8sConfig: KubeConfig, podNamespace: string, resourceVersion?: string, jobName?: string): Promise<V1Pod> {
 let lastResourceVersion = resourceVersion;
  return new Promise<V1Pod>((resolve, reject) => {
    const watch = new Watch(k8sConfig);
    const queryParams: { labelSelector: string, resourceVersion?: string } = { labelSelector: `job-name=${jobName}` };
    if (resourceVersion) {
      queryParams.resourceVersion = resourceVersion;
    }

    watch.watch(`/api/v1/namespaces/${podNamespace}/pods`, queryParams, (eventType, pod: V1Pod) => {
      lastResourceVersion = pod.metadata?.resourceVersion;
      // log.info("WATCH RESULT" + JSON.stringify(pod));
      if (eventType === 'ADDED' && pod.metadata?.name) {
        log.info(`Job pod ${pod.metadata.name} ${pod.metadata?.resourceVersion} added.`);
      }
      if (eventType === 'MODIFIED' && pod.metadata?.name) {
        log.info(`Job pod ${pod.metadata.name} status: ${pod.status?.phase}, resourceVersion: ${pod.metadata?.resourceVersion}.`);
        if (pod.status?.phase === 'Succeeded') {
          //log.info("WATCH RESULT" + JSON.stringify(pod));
          resolve(pod);
        } else if (pod.status?.phase === 'Failed') {
          reject(new Error(`Job failed. Pod ${pod.metadata.name} status: ${pod.status.phase} startTime: ${pod.status.startTime}.`));
        }
      }
    }, (error: { code: string, message: string, stack: string }) => {
      // strange, here I get "null" call, short after the ECONNRESET
      if (error){
        reject(error);
      }
    })
  }).catch(onrejected => {
    if (onrejected && onrejected.code == 'ECONNRESET') {
      log.info(`Restart Watch with ${lastResourceVersion}.`);
      return waitForPodCompletion(log, k8sConfig, podNamespace, lastResourceVersion, jobName);
    } else {
      throw onrejected;
    }
  })

Environment (please complete the following information):

  • OS: Windows
  • NodeJS Versionv20.10.0
  • Cloud runtime Redhat OpenShift

jimjaeger avatar Jan 02 '24 10:01 jimjaeger

A watch is tied to a single TCP stream, so when it is broken you need to start a new watch (and you need to re-list also in case you missed something)

The informer class encapsulates this logic and is probably what you are looking for: https://github.com/kubernetes-client/javascript/blob/master/src/informer.ts

(fwiw, wrt the "informer" name, I think it's confusing, but it got established as the standard name within the go client library, so we use it here too for consistency.)

brendandburns avatar Jan 02 '24 17:01 brendandburns

Thanks for the information. But the informer class has the same problem. The informer also throws the inner connection errors.

jimjaeger avatar Jan 02 '24 17:01 jimjaeger

Same issue here with informer. Tried workaround of periodically starting the informer as suggested in https://github.com/kubernetes-client/javascript/issues/596. Nonetheless, a new issue was hit (see https://github.com/kubernetes-client/javascript/issues/1598)

jobcespedes avatar Mar 06 '24 04:03 jobcespedes

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 04 '24 04:06 k8s-triage-robot

/remove-lifecycle stale

jimjaeger avatar Jun 04 '24 16:06 jimjaeger

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 02 '24 16:09 k8s-triage-robot

/remove-lifecycle stale

jimjaeger avatar Sep 06 '24 16:09 jimjaeger

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 05 '24 17:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jan 04 '25 18:01 k8s-triage-robot

/remove-lifecycle rotten

jimjaeger avatar Jan 04 '25 21:01 jimjaeger

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 04 '25 22:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar May 04 '25 22:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Jun 03 '25 23:06 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jun 03 '25 23:06 k8s-ci-robot