Watch stream missing job completion events
What happened (please include outputs or screenshots):
Sometimes the watch stream seems to be missing job completion events. This is not easy to reproduce as 2 executions of the same code in a row might have different result.
Here is the code, which is watching a job status and printing the logs:
w = watch.Watch()
for event in w.stream(func=batchV1.list_namespaced_job, namespace=namespace, timeout_seconds=0):
if event['object'].metadata.name == jobName:
logging.info(event)
if event['type'] == "ADDED":
logging.info("job %s created, waiting for pod to be running...", jobName)
if event["object"].status.ready:
pods = coreV1.list_namespaced_pod(namespace=namespace,label_selector="job-name={}".format(jobName))
logging.info("pod %s is ready", pods.items[0].metadata.name)
for line in coreV1.read_namespaced_pod_log(name=pods.items[0].metadata.name, namespace=namespace, follow=True, _preload_content=False).stream():
print(line.decode(),end = '')
if event["object"].status.succeeded:
logging.info("Finished pod stream.")
w.stop()
if not event["object"].status.active and event["object"].status.failed:
w.stop()
logging.error("Job Failed")
sys.exit(1)
Sometimes, the script never ends even when the watched job is completed. The script itself is executed in the same Kubernetes cluster but in a different namespace. I tried setting multiple values for timeout_seconds but it doesn't help, the last event is when it becomes active:
[INFO] {'type': 'ADDED', 'object': {'api_version': 'batch/v1', [...] 'job-name': 'my-job-1716468085', [...] 'status': {'active': None, [...], 'ready': None, 'start_time': None [...]
[INFO] job my-job-1716468085 created, waiting for pod to be running...
[INFO] {'type': 'MODIFIED', 'object': {'api_version': 'batch/v1', [...] 'job-name': 'my-job-1716468085', [...] 'status': {'active': 1, [...], 'ready': 0, 'start_time': datetime.datetime(2024, 5, 23, 12, 41, 25, tzinfo=tzlocal()), [...]
The event is correctly updated on Kubernetes side, checking on k9s:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 35m job-controller Created pod: my-job-1716468085-9dq8d
Normal Completed 32m job-controller Job completed
What you expected to happen:
Job completion event should be catch and sent
How to reproduce it (as minimally and precisely as possible):
Just use the above code in python 3.12-slim docker image. As said above, the problem seems to be sporadic. I wasn't able to reproduce it another way yet but I will update this ticket if so.
Anything else we need to know?:
Environment:
- Kubernetes version (
kubectl version): v1.29 (EKS) - OS (e.g., MacOS 10.13.6): N/A
- Python version (
python --version) python 3.12-slim official docker image: https://hub.docker.com/_/python - Python client version (
pip list | grep kubernetes): 29.0.0
Thanks for reporting the issue, please update the ticket when you can reproduce it reliably.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
I realize this isn't a very helpful comment, but I've been encountering the same issue. It's exactly as described earlier—the events are being reported on the Kubernetes side, but the watch seems to be stuck. :/
It mostly happens with long-running jobs (though is 20 minutes really that long?). The issue is quite sporadic and difficult to reproduce consistently...
I'm starting to think that resetting the watcher every 5 minutes or so might be the way to go. I believe the events will still be captured, even if they're sent during the restart. It's not the ideal solution, but if anyone has a better suggestion, I'm open to it.
@headyj did you get past that problem? Thanks!
Might be related to https://github.com/kubernetes-client/python/issues/869. I had to re-establish the watch each I caught:
"Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
With that, I don't have any hanging watchers anymore.
@leseb we are currently using a workaround by adding a retry:
w = watch.Watch()
timedOut = True
for i in range(maxRetry):
for event in w.stream
[...]
if not timedOut:
break
i += 1
if timedOut:
logging.error("Watch stream timed out after %s seconds",maxRetry*timeoutSeconds)
sys.exit(1)
@headyj thanks for your response, here's what I do:
exit_flag = False
while not exit_flag:
try:
for event in w.stream(timeout_seconds=60):
[...]
if success:
exit_flag=True
# Catches the following error:
# urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength
except urllib3.exceptions.ProtocolError as e:
logger.warning("Connection broken reconnecting the watcher %s", str(e))
time.sleep(5) # Backoff before retrying
finally:
w.stop()
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.