argo-workflows
argo-workflows copied to clipboard
Controller crashes when there are a lot of completed workflows
The argo controller seems to be crashing when there are a lot of completed workflows (~6000). I know I could use a workflowTTL and archiving to not keep around a lot of completed workflows, but 6k doesn't seem like a crazy number. Is this a known issue?
Argo v3.2.4
Here are the controller logs:
2022-03-29 15:13:33.173 PDT
controller
time="2022-03-29T22:13:33.173Z" level=info msg="List workflows 200"
Info
2022-03-29 15:14:01.767 PDT
controller
"Trace[518776447]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167 (29-Mar-2022 22:13:01.766) (total time: 60001ms):"
Error
2022-03-29 15:14:01.767 PDT
controller
Trace[518776447]: [1m0.00142874s] [1m0.00142874s] END
Error
2022-03-29 15:14:01.767 PDT
controller
"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 33; INTERNAL_ERROR"
Error
2022-03-29 15:14:28.974 PDT
controller
time="2022-03-29T22:14:28.974Z" level=info msg="List workflows 200"
Error
2022-03-29 15:14:29.045 PDT
controller
time="2022-03-29T22:14:29.045Z" level=info msg=healthz age=5m0s err="workflow never reconciled: cluster-alignment-template-2chwh" instanceID= labelSelector="!workflows.argoproj.io/phase,!workflows.argoproj.io/controller-instanceid" managedNamespace=
Error
2022-03-29 15:14:35.538 PDT
controller
time="2022-03-29T22:14:35.537Z" level=info msg="List workflows 200"
Info
2022-03-29 15:15:03.964 PDT
controller
"Trace[1414262326]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167 (29-Mar-2022 22:14:03.963) (total time: 60001ms):"
Error
2022-03-29 15:15:03.964 PDT
controller
Trace[1414262326]: [1m0.001385813s] [1m0.001385813s] END
Error
2022-03-29 15:15:03.964 PDT
controller
"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed
We have many clusters with > 5k workflows that run happily. You probably need to grow your cluster.
I'm not sure this has to do with the cluster size. We run a pretty big GKE cluster with around 2k nodes. Each job spawns 100+ pods. Could this be related to https://github.com/kubernetes/client-go/issues/871?
Can you maybe think of any experiments I could run (e.g. running the con troller in debug mode) or logs that could be provided to get more clarity into the root cause?
Can you please attach kubectl describe pod -l app=workflow-controller
and see what the state was for the crash? You logs only show health check failure.
/area controller
Name: argo-operator-argo-workflows-workflow-controller-56f8496bbrrstr
Namespace: default
Priority: 0
Node: gke-argo-dev-1-facility-pool-f1b846cb-gn8c/10.16.0.10
Start Time: Wed, 16 Mar 2022 13:41:02 -0700
Labels: app.kubernetes.io/component=workflow-controller
app.kubernetes.io/instance=argo-operator
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=argo-workflows-workflow-controller
app.kubernetes.io/part-of=argo-workflows
app.kubernetes.io/version=v3.2.4
helm.sh/chart=argo-workflows-0.9.2
pod-template-hash=56f8496bb8
Annotations: <none>
Status: Running
IP: 10.32.2.3
IPs:
IP: 10.32.2.3
Controlled By: ReplicaSet/argo-operator-argo-workflows-workflow-controller-56f8496bb8
Containers:
controller:
Container ID: docker://fa09738aaedb866145acd9bcb4b5748a0a98f22a3538e1148ffacf6f32fd8ea0
Image: quay.io/argoproj/workflow-controller:v3.2.4
Image ID: docker-pullable://quay.io/argoproj/workflow-controller@sha256:7d72a6b04f484c7de1fc4a63a418bec2efda822832016439ddf91cb5f8cb3734
Ports: 9090/TCP, 6060/TCP
Host Ports: 0/TCP, 0/TCP
Command:
workflow-controller
Args:
--configmap
argo-operator-argo-workflows-workflow-controller-configmap
--executor-image
quay.io/argoproj/argoexec:v3.2.4
--loglevel
info
--gloglevel
0
--workflow-workers
128
--pod-workers
128
State: Running
Started: Tue, 29 Mar 2022 15:21:36 -0700
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Tue, 29 Mar 2022 15:11:31 -0700
Finished: Tue, 29 Mar 2022 15:16:27 -0700
Ready: True
Restart Count: 112
Limits:
cpu: 30
memory: 96Gi
Requests:
cpu: 30
memory: 96Gi
Liveness: http-get http://:6060/healthz delay=90s timeout=30s period=60s #success=1 #failure=3
Environment:
ARGO_NAMESPACE: default (v1:metadata.namespace)
LEADER_ELECTION_IDENTITY: argo-operator-argo-workflows-workflow-controller-56f8496bbrrstr (v1:metadata.name)
DEFAULT_REQUEUE_TIME: 1m
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-w6vhv (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-w6vhv:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: cloud.google.com/gke-nodepool=facility-pool
kubernetes.io/os=linux
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
Name: argo-operator-argo-workflows-workflow-controller-56f8496bbxl6hp
Namespace: default
Priority: 0
Node: gke-argo-dev-1-facility-pool-f1b846cb-jvo0/10.16.0.9
Start Time: Wed, 16 Mar 2022 13:37:56 -0700
Labels: app.kubernetes.io/component=workflow-controller
app.kubernetes.io/instance=argo-operator
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=argo-workflows-workflow-controller
app.kubernetes.io/part-of=argo-workflows
app.kubernetes.io/version=v3.2.4
helm.sh/chart=argo-workflows-0.9.2
pod-template-hash=56f8496bb8
Annotations: <none>
Status: Running
IP: 10.32.0.2
IPs:
IP: 10.32.0.2
Controlled By: ReplicaSet/argo-operator-argo-workflows-workflow-controller-56f8496bb8
Containers:
controller:
Container ID: docker://ff5bc3213ae916da9ec33e64e43efb3e000865d06b389224a13ab5ed4f1188eb
Image: quay.io/argoproj/workflow-controller:v3.2.4
Image ID: docker-pullable://quay.io/argoproj/workflow-controller@sha256:7d72a6b04f484c7de1fc4a63a418bec2efda822832016439ddf91cb5f8cb3734
Ports: 9090/TCP, 6060/TCP
Host Ports: 0/TCP, 0/TCP
Command:
workflow-controller
Args:
--configmap
argo-operator-argo-workflows-workflow-controller-configmap
--executor-image
quay.io/argoproj/argoexec:v3.2.4
--loglevel
info
--gloglevel
0
--workflow-workers
128
--pod-workers
128
State: Running
Started: Tue, 29 Mar 2022 15:25:11 -0700
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Tue, 29 Mar 2022 15:15:57 -0700
Finished: Tue, 29 Mar 2022 15:19:57 -0700
Ready: True
Restart Count: 113
Limits:
cpu: 30
memory: 96Gi
Requests:
cpu: 30
memory: 96Gi
Liveness: http-get http://:6060/healthz delay=90s timeout=30s period=60s #success=1 #failure=3
Environment:
ARGO_NAMESPACE: default (v1:metadata.namespace)
LEADER_ELECTION_IDENTITY: argo-operator-argo-workflows-workflow-controller-56f8496bbxl6hp (v1:metadata.name)
DEFAULT_REQUEUE_TIME: 1m
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4sd56 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-4sd56:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: cloud.google.com/gke-nodepool=facility-pool
kubernetes.io/os=linux
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
what about kubectl get events
?
Unfortunately the relevant controller events are not being returned by GKE since the default ttl for events on GKE is 1h. (https://stackoverflow.com/questions/63035494/why-does-my-gke-cluster-not-show-any-events). I can reproduce the crash to get you kubectl events
, but that'll probably take me a day or two since I need to spawn a lot of workflows to get hit the issue.
Does the controller use pagination to fetch the workflow objects? I did some digging up and it looks like "stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 695; INTERNAL_ERROR" could be caused by fetching a lot of objects in one go (https://stackoverflow.com/questions/61674050/golang-get-kubernetes-resources30000-configmaps-failed).
This looks very suspicious since the issue goes away when I delete the completed workflows using kubectl delete wf
.
For what it's worth argo delete --older=2d
also fails with a similar error when there are a lot of worfklows in the cluster.
argo delete --older 2d --request-timeout 3h (base) 15:35:21
FATA[2022-03-29T15:37:52.729Z] stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 1; INTERNAL_ERROR
Looks like argo list
supports a --chunk-size
flag for pagination and argo delete
doesn't. If my hypothesis is right, argo-list
should probably throw the same error when used without the --chunk-size
flag when there are a lot of workflows (and I guess that's causing the controller crashes as well). I can spawn a lot of workflows and test this out.
@gandeevan I would recommend looking at the load on k8s api server (k8s master). Looks like API request is getting timed out or returning a stream error. Argo will directly talk to k8s API server through client-go
@sarabala1979 Got it. But wouldn't it be more appropriate to retrieve the large results in chunks instead of scaling up the api-server? I feel that would lend to a more robust controller design.
https://kubernetes.io/docs/reference/using-api/api-concepts/#retrieving-large-results-sets-in-chunks
@gandeevan yes you're right. Can you create an issue on client-go
? Error is happening on client-go. Argo controller is using the client-go library for informers.
"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 33; INTERNAL_ERROR"
Error
@gandeevan you can scale up API server for workaround
@sarabala1979 I guess the change should actually go here in the argo controller by setting a Limit
in tweakListOptions
? tweakListOptions
is currently being set to nil
and is thus retrieving all the objects.
@gandeevan great. Would you like to fix this issue and submit the PR?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/unstale
Please ignore this, I forget to set timeoutSeconds
in Liveness probe
~~Argo v3.3.5 struck in CrashLoopBackOff
in our busiest namespace, and there's lots of context canceled
log about healthz.(We used /metrics
to do liveness probe before this upgrade)~~
~~Is it the same with this issue?~~
time="2022-06-08T12:45:36.600Z" level=info msg=healthz age=5m0s err="Get \"https://10.96.0.1:443/apis/argoproj.io/v1alpha1/namespaces/***/workflows?labelSelector=%21workflows.argoproj.io%2Fphase%2C%21workflows.argoproj.io%2Fcontroller-instanceid\": context canceled" instanceID= labelSelector="!workflows.argoproj.io/phase,!workflows.argoproj.io/controller-instanceid" managedNamespace=***
Normal Created 19m (x4 over 28m) kubelet Created container workflow-controller
Normal Started 19m (x4 over 28m) kubelet Started container workflow-controller
Warning Unhealthy 7m33s (x21 over 27m) kubelet Liveness probe failed: Get "http://10.244.118.204:6060/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning BackOff 3m15s (x8 over 4m32s) kubelet Back-off restarting failed container
We faced a similar problem. A workflow controller in our cluster kept restarting by failing the liveness probe. There are over 6000 Workflow objects in our cluster. Argo Workflow version: v3.3.1
We decided to turn off the liveness probe. The implementation of the healthz endpoint used by liveness probe implementation frequently sends very high-load requests to the k8s API server. It lists all Workloads without informers.
https://github.com/argoproj/argo-workflows/blob/61211f9db1568190dd46b7469fa79eb6530bba73/workflow/controller/healthz.go#L33
This implementation causes not only liveness probe failure but also an unstable state of k8s API by high-load requests.
I would like options to use light-weight healthz endpoint implementation like skipping requesting listing workflows or using informer to list workflows.
It could use the “limit” list option to reduce the number of result returned. That would address the problem.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
/unstale