java icon indicating copy to clipboard operation
java copied to clipboard

Informer silently stops watching in case of network failure

Open sdsingh07 opened this issue 3 years ago • 0 comments
trafficstars

Describe the bug I am using informer to watch k8s events and I have deployed my service on client side. When there are some network failure happens on the client side, the informer dies silently without throwing any exception.

The callGeneratorParams.timeoutSeconds is 5 minutes by default. If the network is back within 5 minutes, then it is working fine. But, if the network goes out for more than 5 minutes, it dies silently and even now if the network comes back on, it is not watching any k8s event (unable to recover from network failure).

Client Version 13.0.1

Kubernetes Version 1.21.12-gke.1700

Java Version openjdk version "11.0.15"

To Reproduce Steps to reproduce the behavior:

Main.java

public class Main {
    public static void main(String... args) throws IOException, InterruptedException, ApiException {
        ApiClient apiClient = Config.defaultClient();
        SharedInformerFactory factory = new SharedInformerFactory();

        new NodeWatcher(apiClient, factory);

        Thread.sleep(20 * 60 * 1000L);

        logger.info("Done");
    }
}

NodeWatcher.java

public class NodeWatcher implements ResourceEventHandler<V1Node> {

  public final SharedInformerFactory factory;

  @SneakyThrows
  public NodeWatcher(ApiClient client,  SharedInformerFactory factory) {
    CoreV1Api coreV1Api = new CoreV1Api(client);

    this.factory = factory;

    this.factory..sharedIndexInformerFor(
            (CallGeneratorParams callGeneratorParams)
                -> {
              try {
                return coreV1Api.listNodeCall(null, null, null, null, null, null, callGeneratorParams.resourceVersion,
                    null, callGeneratorParams.timeoutSeconds, callGeneratorParams.watch, null);
              } catch (ApiException e) {
                log.error("Unknown exception occurred", e);
                throw e;
              }
            },
            V1Node.class, V1NodeList.class)
        .addEventHandler(this);
    this.factory.startAllRegisteredInformers();
  }

  @Override
  public void onAdd(V1Node obj) {
    logger.info("Added: " + obj.getMetadata().getUid() + " "+obj.getMetadata().getResourceVersion());
  }

  @Override
  public void onUpdate(V1Node oldObj, V1Node newObj) {
    logger.info("update to: " + newObj.getMetadata().getUid()+" resourceVersion: "+newObj.getMetadata().getResourceVersion());
  }

  @Override
  public void onDelete(V1Node obj, boolean deletedFinalStateUnknown) {
    logger.info("Deleted: " + obj.getMetadata().getUid());
  }
}

Expected behavior

  • get list event once with onAdd
  • Then get next updated item with increasing resourceVersion using watch call.

Issue But, if there is network failure, the Controller code is not able to execute watch call and exiting every time resulting in list call every 1 seconds, which results in increase in heap size. Also, if network doesn't come back within 5 minutes, it silently stops and unable to recover.

KubeConfig If applicable, add a KubeConfig file with secrets redacted.

- name: tempName
  user:
    auth-provider:
      config:
        cmd-args: config config-helper --format=json
        cmd-path: /Users/username/tempPath
        expiry-key: '{.credential.token_expiry}'
        token-key: '{.credential.access_token}'
      name: gcp

Server (please complete the following information):

  • OS: [e.g. Linux]
  • Environment [e.g. container]
  • Cloud: GCP

Additional context If I create informer and watcher again after network failure, it is working fine. But, it seems that the previous informers and watchers are still in memory and hence the heap size is increasing after every network failure. Is there anyway, I can stop watchers?

Another problem is, I don't know when to create informer and watcher again. So as long as the watcher is dead, I am unable to receive k8s events.

sdsingh07 avatar Aug 02 '22 06:08 sdsingh07