pipelines [backend] Metadata writer pod always restarting

Environment

How did you deploy Kubeflow Pipelines (KFP)? Manifests in k8s
K8S Version: 1.21
KFP version: 1.8.1/1.8.2/1.8.3/1.8.4

Steps to reproduce

Hi.

Since release 1.8.1 (can't be sure about older versions) our metadata-writer pod is always restarting infinitely with the following message error:

metadata-writer-78fc7d5bb8-ph9kj                         2/2     Running   299        78d

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 697, in _update_chunk_length
    self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 438, in _error_catcher
    yield
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 764, in read_chunked
    self._update_chunk_length()
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 701, in _update_chunk_length
    raise InvalidChunkLength(self, line)
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/kfp/metadata_writer/metadata_writer.py", line 157, in <module>
    for event in pod_stream:
  File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 144, in stream
    for line in iter_resp_lines(resp):
  File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
    for seg in resp.read_chunked(decode_content=False):
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 793, in read_chunked
    self._original_response.close()
  File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 455, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

We already try the most recent versions of version 1.8 (we did not try version 2.0.0). The pipelines are working very well, and we don't have any problems till now because of this, but this only happens with this pod.

This happens in our multiple clusters with multiple installations, so don't look like an issue of a specific cluster.

Expected result

The pod should stop restarting.

Impacted by this bug? Give it a 👍.

Aug 26 '22 07:08 andre-lx

Hello @andre-lx , does this issue happen during start-up time, or does it happen when you are running specific pipeline? If there is more information about how to reproduce this issue, it will help us to investigate the problem.

/assign @chensun

Sep 08 '22 22:09 zijianjoy

Hello @andre-lx , does this issue happen during start-up time, or does it happen when you are running specific pipeline? If there is more information about how to reproduce this issue, it will help us to investigate the problem.

/assign @chensun

Hey @zijianjoy .

Completely loss this message. sorry.

This is happens wih all our clusters, as soon we start the kubeflow pipelines the metadata-writes starts restarting with this issue.

this happens until today, with k8s 1.24.

not sure I can give you more information.

but I have some more logs:

bash-5.1# kubectl logs metadata-writer-76675f9f9-tjr7j -n kubeflow
Connected to the metadata store
Start watching Kubernetes Pods created by Argo
bash-5.1# kubectl logs metadata-writer-76675f9f9-tjr7j -n kubeflow --previous
Connected to the metadata store
Start watching Kubernetes Pods created by Argo
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 697, in _update_chunk_length
    self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 438, in _error_catcher
    yield
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 764, in read_chunked
    self._update_chunk_length()
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 701, in _update_chunk_length
    raise InvalidChunkLength(self, line)
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/kfp/metadata_writer/metadata_writer.py", line 157, in <module>
    for event in pod_stream:
  File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 144, in stream
    for line in iter_resp_lines(resp):
  File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
    for seg in resp.read_chunked(decode_content=False):
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 793, in read_chunked
    self._original_response.close()
  File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 455, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

May 16 '23 20:05 andre-lx

I'm not sure the reason or details, but the issue disappear after I rebooted one of the control-plane.

Dec 25 '23 07:12 shaotingcheng

I am getting the same error:

> kubectl version
Client Version: v1.29.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.3

manifest

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: metadata-writer
    application-crd-id: kubeflow-pipelines
  name: metadata-writer
  namespace: kubeflow
spec:
  replicas: 1
  selector:
    matchLabels:
      app: metadata-writer
      application-crd-id: kubeflow-pipelines
  template:
    metadata:
      labels:
        app: metadata-writer
        application-crd-id: kubeflow-pipelines
    spec:
      containers:
      - env:
        - name: NAMESPACE_TO_WATCH
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        image: gcr.io/ml-pipeline/metadata-writer:2.0.5
        name: main
      serviceAccountName: kubeflow-pipelines-metadata-writer

sa - role manifest

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  labels:
    app: kubeflow-pipelines-metadata-writer-role
    application-crd-id: kubeflow-pipelines
  name: kubeflow-pipelines-metadata-writer-role
  namespace: kubeflow
rules:
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - get
  - list
  - watch
  - update
  - patch
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
- apiGroups:
  - argoproj.io
  resources:
  - workflows
  verbs:
  - get
  - list
  - watch
  - update
  - patch

pod - log

Connected to the metadata store
Start watching Kubernetes Pods created by Argo
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 761, in _update_chunk_length
    self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 444, in _error_catcher
    yield
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 828, in read_chunked
    self._update_chunk_length()
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 765, in _update_chunk_length
    raise InvalidChunkLength(self, line)
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/kfp/metadata_writer/metadata_writer.py", line 163, in <module>
    for event in pod_stream:
  File "/usr/local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 144, in stream
    for line in iter_resp_lines(resp):
  File "/usr/local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
    for seg in resp.read_chunked(decode_content=False):
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 857, in read_chunked
    self._original_response.close()
  File "/usr/local/lib/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 461, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

Feb 04 '24 12:02 Rohithzr

I am getting the same error, with kubeflow 1.18 & K8s 1.27.6, will this be fixed in the next kubeflow release?

Feb 19 '24 11:02 akash-gautam

We have a possible solution described in a previous comment. Other than that, we need more info on when it happens and get info about KFP backend, KFP SDK, and k8s versions.

Mar 24 '24 22:03 rimolive

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

May 24 '24 07:05 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Jun 14 '24 07:06 github-actions[bot]