argo-workflows Controller crashes when there are a lot of completed workflows

The argo controller seems to be crashing when there are a lot of completed workflows (~6000). I know I could use a workflowTTL and archiving to not keep around a lot of completed workflows, but 6k doesn't seem like a crazy number. Is this a known issue?

Argo v3.2.4

Here are the controller logs:

2022-03-29 15:13:33.173 PDT
controller
time="2022-03-29T22:13:33.173Z" level=info msg="List workflows 200"
Info
2022-03-29 15:14:01.767 PDT
controller
"Trace[518776447]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167 (29-Mar-2022 22:13:01.766) (total time: 60001ms):"
Error
2022-03-29 15:14:01.767 PDT
controller
Trace[518776447]: [1m0.00142874s] [1m0.00142874s] END
Error
2022-03-29 15:14:01.767 PDT
controller
"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 33; INTERNAL_ERROR"
Error
2022-03-29 15:14:28.974 PDT
controller
time="2022-03-29T22:14:28.974Z" level=info msg="List workflows 200"
Error
2022-03-29 15:14:29.045 PDT
controller
time="2022-03-29T22:14:29.045Z" level=info msg=healthz age=5m0s err="workflow never reconciled: cluster-alignment-template-2chwh" instanceID= labelSelector="!workflows.argoproj.io/phase,!workflows.argoproj.io/controller-instanceid" managedNamespace=
Error
2022-03-29 15:14:35.538 PDT
controller
time="2022-03-29T22:14:35.537Z" level=info msg="List workflows 200"
Info
2022-03-29 15:15:03.964 PDT
controller
"Trace[1414262326]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167 (29-Mar-2022 22:14:03.963) (total time: 60001ms):"
Error
2022-03-29 15:15:03.964 PDT
controller
Trace[1414262326]: [1m0.001385813s] [1m0.001385813s] END
Error
2022-03-29 15:15:03.964 PDT
controller
"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed

Mar 29 '22 22:03 gvos94

We have many clusters with > 5k workflows that run happily. You probably need to grow your cluster.

Mar 29 '22 22:03 alexec

I'm not sure this has to do with the cluster size. We run a pretty big GKE cluster with around 2k nodes. Each job spawns 100+ pods. Could this be related to https://github.com/kubernetes/client-go/issues/871?

Mar 30 '22 00:03 gvos94

Can you maybe think of any experiments I could run (e.g. running the con troller in debug mode) or logs that could be provided to get more clarity into the root cause?

Mar 30 '22 00:03 gvos94

Can you please attach kubectl describe pod -l app=workflow-controller and see what the state was for the crash? You logs only show health check failure.

Mar 30 '22 01:03 alexec

/area controller

Mar 30 '22 01:03 alexec

Name:         argo-operator-argo-workflows-workflow-controller-56f8496bbrrstr
Namespace:    default
Priority:     0
Node:         gke-argo-dev-1-facility-pool-f1b846cb-gn8c/10.16.0.10
Start Time:   Wed, 16 Mar 2022 13:41:02 -0700
Labels:       app.kubernetes.io/component=workflow-controller
              app.kubernetes.io/instance=argo-operator
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=argo-workflows-workflow-controller
              app.kubernetes.io/part-of=argo-workflows
              app.kubernetes.io/version=v3.2.4
              helm.sh/chart=argo-workflows-0.9.2
              pod-template-hash=56f8496bb8
Annotations:  <none>
Status:       Running
IP:           10.32.2.3
IPs:
  IP:           10.32.2.3
Controlled By:  ReplicaSet/argo-operator-argo-workflows-workflow-controller-56f8496bb8
Containers:
  controller:
    Container ID:  docker://fa09738aaedb866145acd9bcb4b5748a0a98f22a3538e1148ffacf6f32fd8ea0
    Image:         quay.io/argoproj/workflow-controller:v3.2.4
    Image ID:      docker-pullable://quay.io/argoproj/workflow-controller@sha256:7d72a6b04f484c7de1fc4a63a418bec2efda822832016439ddf91cb5f8cb3734
    Ports:         9090/TCP, 6060/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      workflow-controller
    Args:
      --configmap
      argo-operator-argo-workflows-workflow-controller-configmap
      --executor-image
      quay.io/argoproj/argoexec:v3.2.4
      --loglevel
      info
      --gloglevel
      0
      --workflow-workers
      128
      --pod-workers
      128
    State:          Running
      Started:      Tue, 29 Mar 2022 15:21:36 -0700
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Tue, 29 Mar 2022 15:11:31 -0700
      Finished:     Tue, 29 Mar 2022 15:16:27 -0700
    Ready:          True
    Restart Count:  112
    Limits:
      cpu:     30
      memory:  96Gi
    Requests:
      cpu:     30
      memory:  96Gi
    Liveness:  http-get http://:6060/healthz delay=90s timeout=30s period=60s #success=1 #failure=3
    Environment:
      ARGO_NAMESPACE:            default (v1:metadata.namespace)
      LEADER_ELECTION_IDENTITY:  argo-operator-argo-workflows-workflow-controller-56f8496bbrrstr (v1:metadata.name)
      DEFAULT_REQUEUE_TIME:      1m
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-w6vhv (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  kube-api-access-w6vhv:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              cloud.google.com/gke-nodepool=facility-pool
                             kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>


Name:         argo-operator-argo-workflows-workflow-controller-56f8496bbxl6hp
Namespace:    default
Priority:     0
Node:         gke-argo-dev-1-facility-pool-f1b846cb-jvo0/10.16.0.9
Start Time:   Wed, 16 Mar 2022 13:37:56 -0700
Labels:       app.kubernetes.io/component=workflow-controller
              app.kubernetes.io/instance=argo-operator
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=argo-workflows-workflow-controller
              app.kubernetes.io/part-of=argo-workflows
              app.kubernetes.io/version=v3.2.4
              helm.sh/chart=argo-workflows-0.9.2
              pod-template-hash=56f8496bb8
Annotations:  <none>
Status:       Running
IP:           10.32.0.2
IPs:
  IP:           10.32.0.2
Controlled By:  ReplicaSet/argo-operator-argo-workflows-workflow-controller-56f8496bb8
Containers:
  controller:
    Container ID:  docker://ff5bc3213ae916da9ec33e64e43efb3e000865d06b389224a13ab5ed4f1188eb
    Image:         quay.io/argoproj/workflow-controller:v3.2.4
    Image ID:      docker-pullable://quay.io/argoproj/workflow-controller@sha256:7d72a6b04f484c7de1fc4a63a418bec2efda822832016439ddf91cb5f8cb3734
    Ports:         9090/TCP, 6060/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      workflow-controller
    Args:
      --configmap
      argo-operator-argo-workflows-workflow-controller-configmap
      --executor-image
      quay.io/argoproj/argoexec:v3.2.4
      --loglevel
      info
      --gloglevel
      0
      --workflow-workers
      128
      --pod-workers
      128
    State:          Running
      Started:      Tue, 29 Mar 2022 15:25:11 -0700
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Tue, 29 Mar 2022 15:15:57 -0700
      Finished:     Tue, 29 Mar 2022 15:19:57 -0700
    Ready:          True
    Restart Count:  113
    Limits:
      cpu:     30
      memory:  96Gi
    Requests:
      cpu:     30
      memory:  96Gi
    Liveness:  http-get http://:6060/healthz delay=90s timeout=30s period=60s #success=1 #failure=3
    Environment:
      ARGO_NAMESPACE:            default (v1:metadata.namespace)
      LEADER_ELECTION_IDENTITY:  argo-operator-argo-workflows-workflow-controller-56f8496bbxl6hp (v1:metadata.name)
      DEFAULT_REQUEUE_TIME:      1m
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4sd56 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  kube-api-access-4sd56:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              cloud.google.com/gke-nodepool=facility-pool
                             kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

Mar 30 '22 04:03 gvos94

what about kubectl get events?

Mar 30 '22 04:03 alexec

Unfortunately the relevant controller events are not being returned by GKE since the default ttl for events on GKE is 1h. (https://stackoverflow.com/questions/63035494/why-does-my-gke-cluster-not-show-any-events). I can reproduce the crash to get you kubectl events, but that'll probably take me a day or two since I need to spawn a lot of workflows to get hit the issue.

Does the controller use pagination to fetch the workflow objects? I did some digging up and it looks like "stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 695; INTERNAL_ERROR" could be caused by fetching a lot of objects in one go (https://stackoverflow.com/questions/61674050/golang-get-kubernetes-resources30000-configmaps-failed).

This looks very suspicious since the issue goes away when I delete the completed workflows using kubectl delete wf.

For what it's worth argo delete --older=2d also fails with a similar error when there are a lot of worfklows in the cluster.

argo delete --older 2d --request-timeout 3h                                                                       (base) 15:35:21
FATA[2022-03-29T15:37:52.729Z] stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 1; INTERNAL_ERROR

Mar 30 '22 04:03 gvos94

Looks like argo list supports a --chunk-size flag for pagination and argo delete doesn't. If my hypothesis is right, argo-list should probably throw the same error when used without the --chunk-size flag when there are a lot of workflows (and I guess that's causing the controller crashes as well). I can spawn a lot of workflows and test this out.

Mar 30 '22 05:03 gvos94

@gandeevan I would recommend looking at the load on k8s api server (k8s master). Looks like API request is getting timed out or returning a stream error. Argo will directly talk to k8s API server through client-go

Mar 30 '22 17:03 sarabala1979

@sarabala1979 Got it. But wouldn't it be more appropriate to retrieve the large results in chunks instead of scaling up the api-server? I feel that would lend to a more robust controller design.

https://kubernetes.io/docs/reference/using-api/api-concepts/#retrieving-large-results-sets-in-chunks

Mar 30 '22 17:03 gvos94

@gandeevan yes you're right. Can you create an issue on client-go? Error is happening on client-go. Argo controller is using the client-go library for informers.

"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 33; INTERNAL_ERROR"
Error

Mar 30 '22 18:03 sarabala1979

@gandeevan you can scale up API server for workaround

Mar 30 '22 18:03 sarabala1979

@sarabala1979 I guess the change should actually go here in the argo controller by setting a Limit in tweakListOptions? tweakListOptions is currently being set to nil and is thus retrieving all the objects.

Mar 30 '22 19:03 gvos94

@gandeevan great. Would you like to fix this issue and submit the PR?

Mar 30 '22 23:03 sarabala1979

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Apr 16 '22 02:04 stale[bot]

/unstale

Apr 25 '22 14:04 k1rk

Please ignore this, I forget to set timeoutSeconds in Liveness probe ~~Argo v3.3.5 struck in CrashLoopBackOff in our busiest namespace, and there's lots of context canceled log about healthz.(We used /metrics to do liveness probe before this upgrade)~~

~~Is it the same with this issue?~~

time="2022-06-08T12:45:36.600Z" level=info msg=healthz age=5m0s err="Get \"https://10.96.0.1:443/apis/argoproj.io/v1alpha1/namespaces/***/workflows?labelSelector=%21workflows.argoproj.io%2Fphase%2C%21workflows.argoproj.io%2Fcontroller-instanceid\": context canceled" instanceID= labelSelector="!workflows.argoproj.io/phase,!workflows.argoproj.io/controller-instanceid" managedNamespace=***

  Normal   Created    19m (x4 over 28m)      kubelet            Created container workflow-controller
  Normal   Started    19m (x4 over 28m)      kubelet            Started container workflow-controller
  Warning  Unhealthy  7m33s (x21 over 27m)   kubelet            Liveness probe failed: Get "http://10.244.118.204:6060/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  BackOff    3m15s (x8 over 4m32s)  kubelet            Back-off restarting failed container

Jun 09 '22 08:06 minglei-pony

We faced a similar problem. A workflow controller in our cluster kept restarting by failing the liveness probe. There are over 6000 Workflow objects in our cluster. Argo Workflow version: v3.3.1

We decided to turn off the liveness probe. The implementation of the healthz endpoint used by liveness probe implementation frequently sends very high-load requests to the k8s API server. It lists all Workloads without informers.

https://github.com/argoproj/argo-workflows/blob/61211f9db1568190dd46b7469fa79eb6530bba73/workflow/controller/healthz.go#L33

This implementation causes not only liveness probe failure but also an unstable state of k8s API by high-load requests.

I would like options to use light-weight healthz endpoint implementation like skipping requesting listing workflows or using informer to list workflows.

Jun 30 '22 11:06 shioshiota

It could use the “limit” list option to reduce the number of result returned. That would address the problem.

Jun 30 '22 15:06 alexec

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

Jul 31 '22 04:07 stale[bot]

/unstale

Jul 31 '22 09:07 roofurmston

argo-workflows argo-workflows copied to clipboard

Controller crashes when there are a lot of completed workflows

argo-workflows
argo-workflows copied to clipboard