java
java copied to clipboard
Informer silently stops watching in case of network failure
Describe the bug I am using informer to watch k8s events and I have deployed my service on client side. When there are some network failure happens on the client side, the informer dies silently without throwing any exception.
The callGeneratorParams.timeoutSeconds is 5 minutes by default. If the network is back within 5 minutes, then it is working fine. But, if the network goes out for more than 5 minutes, it dies silently and even now if the network comes back on, it is not watching any k8s event (unable to recover from network failure).
Client Version
13.0.1
Kubernetes Version
1.21.12-gke.1700
Java Version openjdk version "11.0.15"
To Reproduce Steps to reproduce the behavior:
Main.java
public class Main {
public static void main(String... args) throws IOException, InterruptedException, ApiException {
ApiClient apiClient = Config.defaultClient();
SharedInformerFactory factory = new SharedInformerFactory();
new NodeWatcher(apiClient, factory);
Thread.sleep(20 * 60 * 1000L);
logger.info("Done");
}
}
NodeWatcher.java
public class NodeWatcher implements ResourceEventHandler<V1Node> {
public final SharedInformerFactory factory;
@SneakyThrows
public NodeWatcher(ApiClient client, SharedInformerFactory factory) {
CoreV1Api coreV1Api = new CoreV1Api(client);
this.factory = factory;
this.factory..sharedIndexInformerFor(
(CallGeneratorParams callGeneratorParams)
-> {
try {
return coreV1Api.listNodeCall(null, null, null, null, null, null, callGeneratorParams.resourceVersion,
null, callGeneratorParams.timeoutSeconds, callGeneratorParams.watch, null);
} catch (ApiException e) {
log.error("Unknown exception occurred", e);
throw e;
}
},
V1Node.class, V1NodeList.class)
.addEventHandler(this);
this.factory.startAllRegisteredInformers();
}
@Override
public void onAdd(V1Node obj) {
logger.info("Added: " + obj.getMetadata().getUid() + " "+obj.getMetadata().getResourceVersion());
}
@Override
public void onUpdate(V1Node oldObj, V1Node newObj) {
logger.info("update to: " + newObj.getMetadata().getUid()+" resourceVersion: "+newObj.getMetadata().getResourceVersion());
}
@Override
public void onDelete(V1Node obj, boolean deletedFinalStateUnknown) {
logger.info("Deleted: " + obj.getMetadata().getUid());
}
}
Expected behavior
- get list event once with onAdd
- Then get next updated item with increasing resourceVersion using watch call.
Issue But, if there is network failure, the Controller code is not able to execute watch call and exiting every time resulting in list call every 1 seconds, which results in increase in heap size. Also, if network doesn't come back within 5 minutes, it silently stops and unable to recover.
KubeConfig If applicable, add a KubeConfig file with secrets redacted.
- name: tempName
user:
auth-provider:
config:
cmd-args: config config-helper --format=json
cmd-path: /Users/username/tempPath
expiry-key: '{.credential.token_expiry}'
token-key: '{.credential.access_token}'
name: gcp
Server (please complete the following information):
- OS: [e.g. Linux]
- Environment [e.g. container]
- Cloud: GCP
Additional context If I create informer and watcher again after network failure, it is working fine. But, it seems that the previous informers and watchers are still in memory and hence the heap size is increasing after every network failure. Is there anyway, I can stop watchers?
Another problem is, I don't know when to create informer and watcher again. So as long as the watcher is dead, I am unable to receive k8s events.