python Watch stream missing job completion events

What happened (please include outputs or screenshots):

Sometimes the watch stream seems to be missing job completion events. This is not easy to reproduce as 2 executions of the same code in a row might have different result.

Here is the code, which is watching a job status and printing the logs:

w = watch.Watch()
for event in w.stream(func=batchV1.list_namespaced_job, namespace=namespace, timeout_seconds=0):
  if event['object'].metadata.name == jobName:
    logging.info(event)
    if event['type'] == "ADDED":
      logging.info("job %s created, waiting for pod to be running...", jobName)
    if event["object"].status.ready:
      pods = coreV1.list_namespaced_pod(namespace=namespace,label_selector="job-name={}".format(jobName))
      logging.info("pod %s is ready", pods.items[0].metadata.name)
      for line in coreV1.read_namespaced_pod_log(name=pods.items[0].metadata.name, namespace=namespace, follow=True, _preload_content=False).stream():
        print(line.decode(),end = '')
    if event["object"].status.succeeded:
      logging.info("Finished pod stream.")
      w.stop()
    if not event["object"].status.active and event["object"].status.failed:
      w.stop()
      logging.error("Job Failed")
      sys.exit(1)

Sometimes, the script never ends even when the watched job is completed. The script itself is executed in the same Kubernetes cluster but in a different namespace. I tried setting multiple values for timeout_seconds but it doesn't help, the last event is when it becomes active:

[INFO] {'type': 'ADDED', 'object': {'api_version': 'batch/v1', [...] 'job-name': 'my-job-1716468085', [...] 'status': {'active': None, [...], 'ready': None, 'start_time': None [...]
[INFO] job my-job-1716468085 created, waiting for pod to be running...
[INFO] {'type': 'MODIFIED', 'object': {'api_version': 'batch/v1', [...] 'job-name': 'my-job-1716468085', [...] 'status': {'active': 1, [...], 'ready': 0, 'start_time': datetime.datetime(2024, 5, 23, 12, 41, 25, tzinfo=tzlocal()), [...]

The event is correctly updated on Kubernetes side, checking on k9s:

Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  35m   job-controller  Created pod: my-job-1716468085-9dq8d
  Normal  Completed         32m   job-controller  Job completed

What you expected to happen:

Job completion event should be catch and sent

How to reproduce it (as minimally and precisely as possible):

Just use the above code in python 3.12-slim docker image. As said above, the problem seems to be sporadic. I wasn't able to reproduce it another way yet but I will update this ticket if so.

Anything else we need to know?:

Environment:

Kubernetes version (kubectl version): v1.29 (EKS)
OS (e.g., MacOS 10.13.6): N/A
Python version (python --version) python 3.12-slim official docker image: https://hub.docker.com/_/python
Python client version (pip list | grep kubernetes): 29.0.0

May 23 '24 13:05 headyj

Thanks for reporting the issue, please update the ticket when you can reproduce it reliably.

Jun 05 '24 20:06 yliaog

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Sep 03 '24 20:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Oct 03 '24 21:10 k8s-triage-robot

I realize this isn't a very helpful comment, but I've been encountering the same issue. It's exactly as described earlier—the events are being reported on the Kubernetes side, but the watch seems to be stuck. :/

It mostly happens with long-running jobs (though is 20 minutes really that long?). The issue is quite sporadic and difficult to reproduce consistently...

I'm starting to think that resetting the watcher every 5 minutes or so might be the way to go. I believe the events will still be captured, even if they're sent during the restart. It's not the ideal solution, but if anyone has a better suggestion, I'm open to it.

@headyj did you get past that problem? Thanks!

Oct 16 '24 08:10 leseb

Might be related to https://github.com/kubernetes-client/python/issues/869. I had to re-establish the watch each I caught:

"Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

With that, I don't have any hanging watchers anymore.

Oct 16 '24 12:10 leseb

@leseb we are currently using a workaround by adding a retry:

w = watch.Watch()
timedOut = True
for i in range(maxRetry):
  for event in w.stream
    [...]
  if not timedOut:
      break
  i += 1
if timedOut:
  logging.error("Watch stream timed out after %s seconds",maxRetry*timeoutSeconds)
  sys.exit(1)

Oct 18 '24 08:10 headyj

@headyj thanks for your response, here's what I do:

exit_flag = False
while not exit_flag:
  try:
    for event in w.stream(timeout_seconds=60):
      [...]
      if success:
        exit_flag=True

  # Catches the following error:
  # urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength
  except urllib3.exceptions.ProtocolError as e:
    logger.warning("Connection broken reconnecting the watcher %s", str(e))
    time.sleep(5)  # Backoff before retrying
  finally:
    w.stop()

Oct 18 '24 09:10 leseb

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Nov 17 '24 09:11 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Nov 17 '24 09:11 k8s-ci-robot