argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

Controller: `invalid config map object received in config watcher. Ignored processing` with CPU spike

Open juranir opened this issue 1 year ago • 72 comments

edits below by agilgur5 for clarity as this issue is very popular

Updates summary

edits above by agilgur5 for clarity as this issue is very popular


Pre-requisites

  • [X] I have double-checked my configuration
  • [X] I can confirm the issues exists when I tested with :latest
  • [x] I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

We're migrating our Argo workflow from v3.0.2 hosted in an old Rancher cluster to the last version in an AWS EKS cluster.

Ramdomically the workflow controller increases de CPU usage and starts to show this error. level=error msg="invalid config map object received in config watcher. Ignored processing"

If we remove the pod, the new one starts well, but after 1 or 2 days (random), the error reappears.

The last one started yesterday at the moment without workflow running nor scheduled tasks on the infra side.

We don't have an error in the k8s API log to indicate some EKS outage.

The Argo server component doesn't have errors.

We don't have this error in the old version.

Env: Argo: v3.4.10 Installation method: Official Helm Chart (0.33.1) K8s Cluster: AWS EKS 1.27


Workflow-controller k8s config map:

Name:         argo-workflow-controller-configmap
Namespace:    argo
Labels:       app.kubernetes.io/component=workflow-controller
              app.kubernetes.io/instance=argo
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=argo-workflows-cm
              app.kubernetes.io/part-of=argo-workflows
              helm.sh/chart=argo-workflows-0.33.1
Annotations:  meta.helm.sh/release-name: argo
              meta.helm.sh/release-namespace: argo
data:
  metricsConfig:
    enabled: true
    path: /metrics
    port: 9090
    ignoreErrors: false
    secure: false
  persistence:
    archive: true
    archiveTTL: 60d
    postgresql:
      database: argo
      host: <REDACTED>
      passwordSecret:
        key: password
        name: argo-postgres-config
      port: 5432
      ssl: false
      sslMode: disabled
      tableName: argo_workflows
      userNameSecret:
        key: username
        name: argo-postgres-config
  workflowDefaults:
    spec:
      ttlStrategy:
        secondsAfterFailure: 604800
  sso:
    issuer: https://<REDACTED>
    clientId:
      name: argo-server-clientid
      key: clientid
    clientSecret:
      name: argo-server-clientsecret
      key: clientsecret
    redirectUrl: https://<REDACTED>/oauth2/callback
    rbac:
      enabled: true
    scopes: 
      - email
      - profile
      - openid
    issuerAlias: https://<REDACTED>
    sessionExpiry: 240h
  nodeEvents:
    enabled: false

Resource Usage image

Version

v3.4.10

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

N/A Since it starts randomly, we usually don't have any workflows running

Logs from the workflow controller

Before (nothing):

After (Lines like that appear several times - I only have it in the container output):

time="2023-08-23T09:41:01.370Z" level=error msg="invalid config map object received in config watcher. Ignored processing"

Logs from in your workflow's wait container

N/A

juranir avatar Aug 23 '23 10:08 juranir

We don't have this error in the old version.

What was your old version? 3.4.9? If it is a regression, knowing between which two versions would be very helpful

agilgur5 avatar Aug 23 '23 17:08 agilgur5

What was your old version? 3.4.9?

No, this old version is v3.0.2.

This same behavior also happened with the 2 last versions (we made some downgrades to try to evaluate).

We just installed the last version to validate if everything would be smooth, so all workflows were good, it's the only issue we had.

juranir avatar Aug 23 '23 18:08 juranir

Gotcha. Been a lot of changes since 3.0.2. Though I took a look and this specific code hasn't been changed in years since it was written.

Only thing I can potentially guess at is maybe your ConfigMap has some old config from 3.0.2 that is no longer valid in 3.4.10? Not sure why it wouldn't error earlier though

agilgur5 avatar Aug 24 '23 02:08 agilgur5

Thanks a lot for your help :D

I just compared my values with chart values and I don't have any unknown parameters.

Do you have any idea how could I debug more this error? Like figuring out the kind of object this code is receiving or any kind of data

juranir avatar Aug 24 '23 08:08 juranir

Could try to run a debugger and inspect the variables then. Or could try to modify the code to log it out, but trying to log may itself fail.

agilgur5 avatar Aug 25 '23 18:08 agilgur5

I updated the code to try to register the "invalid object" that is causing the error. What is happening is that the event variable is null. I don't have another error log. I'm using AWS EKS, so I also checked its log and I have no errors.

So I have no idea about this :(

juranir avatar Sep 04 '23 09:09 juranir

Maybe this issue helps explain this behavior (I also found other complaints similar to this for projects other than Argo), if so I think we would add a conditional check before casting the object:

	// [...]
	select {
		case event := <-retryWatcher.ResultChan():
			if event.Object == nil {
				return
			}
			cm, ok := event.Object.(*apiv1.ConfigMap)
	// [...]

I'm assuming with this change the defer will be executed and a new instance of the watcher will be executed.

Does it make sense to you @agilgur5 ?

juranir avatar Sep 04 '23 13:09 juranir

Hmmm interesting find!

if event.Object == nil {

Didn't you say that event itself was nil?

Maybe this issue helps explain this behavior (I also found other complaints similar to this for projects other than Argo), if so I think we would add a conditional check before casting the object:

So in that issue they suggest using a Reflector or Informer instead to handle several kinds of edge cases including this one. The Argo codebase does use Informers very frequently already, so maybe we should just convert this one. The RetryWatcher source code does mention using Informers instead as well for certain edge cases.

agilgur5 avatar Sep 04 '23 17:09 agilgur5

We're running 3.4.9 and have seen this happen twice in the last month, where the controller starts logging multiple times per second the line level=error msg="invalid config map object received in config watcher. Ignored processing" and starts consuming a lot more resources, and the problems do not go away until we delete the pod. The config map itself is properly configured, and after a few days of running ok, it suddenly begins to spam that error log.

For context, the watcher that logs that error is actually watching the wrong namespace when both --namespaced and --managed-namespace=<other-namespace> are used (see https://github.com/argoproj/argo-workflows/issues/11463). I have no idea whether that is related to the nil event or not, but it leads to the watcher actually watching a non-existent config map.

toblich avatar Sep 11 '23 14:09 toblich

So in that issue they https://github.com/kubernetes/client-go/issues/334#issuecomment-370561966 using a Reflector or Informer instead to handle several kinds of edge cases including this one. The Argo codebase does use Informers very frequently already, so maybe we should just convert this one. The RetryWatcher source code does mention using Informers instead as well for certain edge cases.

@agilgur5 I'm testing the code below, so far everything is fine, but I'm not sure if it's the best approach:

    func (wfc *WorkflowController) runConfigMapWatcher() {
    defer runtimeutil.HandleCrash(runtimeutil.PanicHandlers...)
	ctx := context.Background()

	factory := informers.NewSharedInformerFactoryWithOptions(
		wfc.kubeclientset,
		0,
		informers.WithNamespace(wfc.managedNamespace),
	)

	informer := factory.Core().V1().ConfigMaps().Informer()

	stopper := make(chan struct{})

	defer close(stopper)

	informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
		UpdateFunc: func(obj interface{}, newObj interface{}) {
			cm := newObj.(*apiv1.ConfigMap)

			log.Infof("CM Name: %s", cm.GetName())

			if cm.GetName() == wfc.configController.GetName() && wfc.namespace == cm.GetNamespace() {
				log.Infof("Received Workflow Controller config map %s/%s update", cm.Namespace, cm.Name)
				wfc.UpdateConfig(ctx)
			}
			wfc.notifySemaphoreConfigUpdate(cm)
		},
	})
	go informer.Run(stopper)
	if !cache.WaitForCacheSync(stopper, informer.HasSynced) {
		log.Error(fmt.Errorf("Timed out waiting for caches to sync"))
		return
	}
	<-stopper
	}
```	

juranir avatar Sep 12 '23 09:09 juranir

@agilgur5 I'm testing the code below, so far everything is fine, but I'm not sure if it's the best approach:

A bit hard to read as there isn't syntax highlighting and the indentation got screwed up a little, but that roughly looks correct to me. It also looks similar to other parts of the codebase.

Only thing I can think of is that AddFunc and DeleteFunc might be necessary for if the ConfigMap gets entirely deleted or added during runtime. But I'm not sure if the old code even handled that and it is an edge of an edge case, so not super critical.

Admittedly, I am not the most familiar with the internal workings here. If you want to submit a PR, we can certainly get others to give potentially more nuanced feedback (or not if it all checks out)

but I'm not sure if it's the best approach

Is there anything in particular that you were concerned about implementation-wise?

agilgur5 avatar Sep 12 '23 20:09 agilgur5

I waited a bit to see if the error would appear again. So far this code has solved the initial problem, but a new message has started to appear in the log (every 1min): listers.go:79] can not retrieve list of objects using index : Index with name namespace does not exist

Even with this message, everything still works well, but I believe I made some mistakes that generated this error.

@agilgur5 or does anyone have an idea about that?

PS. Without my change this message will not appear

juranir avatar Sep 18 '23 15:09 juranir

Not sure about the exact error message, but maybe the resync period duration of 0 from your previous code comment is too short? I think most of the existing informers in the codebase are longer (and also it seems like the resync period parameter is very confusing to informer users from a quick search -- it is not the same as a re-list). Also there is a separate issue mentioned above re: managed namespace that potentially impacts it. If you change it to use regular NS and that fixes it, then the error would be due to #11463

agilgur5 avatar Sep 19 '23 01:09 agilgur5

We also just updated to 3.4.10 and see this problem as well. any workarounds available? Should we downgrade to 3.4.8?

roitalpaz avatar Sep 19 '23 06:09 roitalpaz

I tried to increase the resync period and I also tried to define the namespace manually, but the message still appeared every minute. I think this log message comes from this code.

I'm not sure if it's a real error, but I think so because I hadn't this log before

I also tested with the latest version and the error still remains.

juranir avatar Sep 19 '23 10:09 juranir

I'm starting to think I'm doing something wrong in the image construction process.

I tried cloning the repo and running the command make workflow-controller-image (without file changes), and the error message listers.go:79] can not retrieve list of objects using index : Index with name namespace does not exist keeps appearing, but if I use the official image it doesn't.

I tried using Windows and Mac M1(using --platform linux/amd64 parameter) as host.

PS. The build task ends without error.

juranir avatar Sep 20 '23 10:09 juranir

There is some problem with the local environment (I don't know why). I downloaded the latest version code and made the same change. now the core is working fine without the weird message.

I'll wait some time and create a PR.

juranir avatar Sep 21 '23 15:09 juranir

We also just updated to 3.4.10 and see this problem as well. Should we downgrade to 3.4.8?

@roitalpaz this seems to have existed well before 3.4.10. as I mentioned above, this code path hasn't changed in years

I'd guess that occurrence was incidental as such, especially as there are no new features in 3.4.10 (if there was a configmap feature that might add up). Perhaps this issue just happens more frequently after Controller restarts

any workarounds available?

as mentioned in the opening comment, restarting your Controller Pod will resolve this

agilgur5 avatar Sep 30 '23 05:09 agilgur5

We are currently running into the same issue and it breaks our whole logging system. Maybe it's in general a good idea to change the log level to Debug in places which could generate such high load to the logging system: https://github.com/argoproj/argo-workflows/blob/206c901c6ff122ad06a32820763e504cc5caf190/workflow/controller/controller.go#L414 ?

zzikkzzakk avatar Nov 17 '23 10:11 zzikkzzakk

Maybe it's in general a good idea to change the log level to Debug

Log levels should be set based on their context. This is an error and so an error level is correct. Errors are meant to be highly visible, you wouldn't want them to be silent or easily missed. Debug logs are mainly supposed to be helpful when debugging and not normally visible.

in places which could generate such high load to the logging system

This is not necessarily something you can tell ahead of time. In this case, this is an edge case with RetryWatchers that is apparently unhandled. The code did not expect that and so it logs an error. Because it logged this error a bug report was filed and it is now being fixed.

agilgur5 avatar Nov 17 '23 15:11 agilgur5

I've started observing same behaviour two days after I updated EKS 1.24 -> 1.28. Argo workflow version: v3.3.5

andrii-shevchyk avatar Nov 25 '23 14:11 andrii-shevchyk

We also are experiencing this issue, overloading our logging system after upgrading.

Are there any other implications to this issue? We can drop the log, or wait with version upgrades to our production environments until this is resolved, which is not ideal of course

Thank you

yaaraze avatar Nov 27 '23 08:11 yaaraze

Running 3.5.1 on EKS 1.28.3, same issue.

razielgn avatar Nov 27 '23 11:11 razielgn

Seeing this with 3.5.1 too, we had to add a log index exclusion filter in DataDog for the workflow-controller as it would consume ALOT of log volume.

stefansedich avatar Nov 30 '23 19:11 stefansedich

Seeing the same issue in EKS. Argo-controller is producing ~1.8 Mb/s of logs.

EKS: v1.28.3-eks-4f4795d Argo controller: v3.4.4

nickyfoster avatar Dec 01 '23 09:12 nickyfoster

It seems this is only happening with kubernetes version 1.27 or later. Does anyone have evidence of it happening on earlier versions?

We are still using client-go from 0.24, which should not be compatible with 1.27, so perhaps that's the cause here. https://github.com/argoproj/argo-workflows/blob/6c0697928f92160f84e2fc6581551d26e13c0dc5/go.mod#L70

Joibel avatar Dec 01 '23 11:12 Joibel

It seems this is only happening with kubernetes version 1.27 or later. Does anyone have evidence of it happening on earlier versions?

👀 I was wondering why we suddenly got more reports of this.

We are still using client-go from 0.24, which should not be compatible with 1.27, so perhaps that's the cause here.

The informer change also appears to fix it and we should still make that change, but if there's a simpler interim fix if that'd be great.

k8s is backward compatible with all clients (see also the client-go compat docs), so I'm not so sure that upgrading client-go would fix it. It's possible they handled this RetryWatcher edge case better in newer versions if they've seen this more frequently in newer k8s versions, but one would think that would get backported as a fix per the client-go policy.

agilgur5 avatar Dec 01 '23 16:12 agilgur5

maybe this k8s 1.27 change is related? https://github.com/kubernetes/kubernetes/pull/115620

tooptoop4 avatar Dec 01 '23 20:12 tooptoop4

maybe this k8s 1.27 change is related? kubernetes/kubernetes#115620

that seems to be an Informer change, whereas the current code uses a RetryWatcher (it is being updated to an Informer in #11865)

agilgur5 avatar Dec 02 '23 02:12 agilgur5

Is there a workaround for this issue ? Our logs are being flooded with the particular error.

pcperera avatar Jan 01 '24 03:01 pcperera