[backend] Metadata writer pod always restarting
Environment
- How did you deploy Kubeflow Pipelines (KFP)? Manifests in k8s
- K8S Version: 1.21
- KFP version: 1.8.1/1.8.2/1.8.3/1.8.4
Steps to reproduce
Hi.
Since release 1.8.1 (can't be sure about older versions) our metadata-writer pod is always restarting infinitely with the following message error:
metadata-writer-78fc7d5bb8-ph9kj 2/2 Running 299 78d
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 697, in _update_chunk_length
self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 438, in _error_catcher
yield
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 764, in read_chunked
self._update_chunk_length()
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 701, in _update_chunk_length
raise InvalidChunkLength(self, line)
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/kfp/metadata_writer/metadata_writer.py", line 157, in <module>
for event in pod_stream:
File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 144, in stream
for line in iter_resp_lines(resp):
File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
for seg in resp.read_chunked(decode_content=False):
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 793, in read_chunked
self._original_response.close()
File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 455, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
We already try the most recent versions of version 1.8 (we did not try version 2.0.0). The pipelines are working very well, and we don't have any problems till now because of this, but this only happens with this pod.
This happens in our multiple clusters with multiple installations, so don't look like an issue of a specific cluster.
Expected result
The pod should stop restarting.
Impacted by this bug? Give it a 👍.
Hello @andre-lx , does this issue happen during start-up time, or does it happen when you are running specific pipeline? If there is more information about how to reproduce this issue, it will help us to investigate the problem.
/assign @chensun
Hello @andre-lx , does this issue happen during start-up time, or does it happen when you are running specific pipeline? If there is more information about how to reproduce this issue, it will help us to investigate the problem.
/assign @chensun
Hey @zijianjoy .
Completely loss this message. sorry.
This is happens wih all our clusters, as soon we start the kubeflow pipelines the metadata-writes starts restarting with this issue.
this happens until today, with k8s 1.24.
not sure I can give you more information.
but I have some more logs:
bash-5.1# kubectl logs metadata-writer-76675f9f9-tjr7j -n kubeflow
Connected to the metadata store
Start watching Kubernetes Pods created by Argo
bash-5.1# kubectl logs metadata-writer-76675f9f9-tjr7j -n kubeflow --previous
Connected to the metadata store
Start watching Kubernetes Pods created by Argo
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 697, in _update_chunk_length
self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 438, in _error_catcher
yield
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 764, in read_chunked
self._update_chunk_length()
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 701, in _update_chunk_length
raise InvalidChunkLength(self, line)
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/kfp/metadata_writer/metadata_writer.py", line 157, in <module>
for event in pod_stream:
File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 144, in stream
for line in iter_resp_lines(resp):
File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
for seg in resp.read_chunked(decode_content=False):
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 793, in read_chunked
self._original_response.close()
File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 455, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
I'm not sure the reason or details, but the issue disappear after I rebooted one of the control-plane.
I am getting the same error:
> kubectl version
Client Version: v1.29.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.3
manifest
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: metadata-writer
application-crd-id: kubeflow-pipelines
name: metadata-writer
namespace: kubeflow
spec:
replicas: 1
selector:
matchLabels:
app: metadata-writer
application-crd-id: kubeflow-pipelines
template:
metadata:
labels:
app: metadata-writer
application-crd-id: kubeflow-pipelines
spec:
containers:
- env:
- name: NAMESPACE_TO_WATCH
valueFrom:
fieldRef:
fieldPath: metadata.namespace
image: gcr.io/ml-pipeline/metadata-writer:2.0.5
name: main
serviceAccountName: kubeflow-pipelines-metadata-writer
sa - role manifest
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
labels:
app: kubeflow-pipelines-metadata-writer-role
application-crd-id: kubeflow-pipelines
name: kubeflow-pipelines-metadata-writer-role
namespace: kubeflow
rules:
- apiGroups:
- ""
resources:
- pods
verbs:
- get
- list
- watch
- update
- patch
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- apiGroups:
- argoproj.io
resources:
- workflows
verbs:
- get
- list
- watch
- update
- patch
pod - log
Connected to the metadata store
Start watching Kubernetes Pods created by Argo
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 761, in _update_chunk_length
self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 444, in _error_catcher
yield
File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 828, in read_chunked
self._update_chunk_length()
File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 765, in _update_chunk_length
raise InvalidChunkLength(self, line)
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/kfp/metadata_writer/metadata_writer.py", line 163, in <module>
for event in pod_stream:
File "/usr/local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 144, in stream
for line in iter_resp_lines(resp):
File "/usr/local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
for seg in resp.read_chunked(decode_content=False):
File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 857, in read_chunked
self._original_response.close()
File "/usr/local/lib/python3.8/contextlib.py", line 131, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 461, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
I am getting the same error, with kubeflow 1.18 & K8s 1.27.6, will this be fixed in the next kubeflow release?
We have a possible solution described in a previous comment. Other than that, we need more info on when it happens and get info about KFP backend, KFP SDK, and k8s versions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.