argo-workflows
argo-workflows copied to clipboard
Controller: `invalid config map object received in config watcher. Ignored processing` with CPU spike
edits below by agilgur5 for clarity as this issue is very popular
Updates summary
- this seems to be caused by upgrading your cluster to k8s 1.27
- an upstream k8s fix was applied in 1.27.13+, 1.28.9+, 1.29.4+, and 1.30+ and confirmed working
- there is a workaround available in Argo 3.5.5, setting
WATCH_CONTROLLER_SEMAPHORE_CONFIGMAPS=false
. See #12622- you can also workaround the log spam with log filters, such as this one for DataDog
- a larger fix is under way in #11855
- you can help test this!
edits above by agilgur5 for clarity as this issue is very popular
Pre-requisites
- [X] I have double-checked my configuration
- [X] I can confirm the issues exists when I tested with
:latest
- [x] I'd like to contribute the fix myself (see contributing guide)
What happened/what you expected to happen?
We're migrating our Argo workflow from v3.0.2 hosted in an old Rancher cluster to the last version in an AWS EKS cluster.
Ramdomically the workflow controller increases de CPU usage and starts to show this error.
level=error msg="invalid config map object received in config watcher. Ignored processing"
If we remove the pod, the new one starts well, but after 1 or 2 days (random), the error reappears.
The last one started yesterday at the moment without workflow running nor scheduled tasks on the infra side.
We don't have an error in the k8s API log to indicate some EKS outage.
The Argo server component doesn't have errors.
We don't have this error in the old version.
Env: Argo: v3.4.10 Installation method: Official Helm Chart (0.33.1) K8s Cluster: AWS EKS 1.27
Workflow-controller k8s config map:
Name: argo-workflow-controller-configmap
Namespace: argo
Labels: app.kubernetes.io/component=workflow-controller
app.kubernetes.io/instance=argo
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=argo-workflows-cm
app.kubernetes.io/part-of=argo-workflows
helm.sh/chart=argo-workflows-0.33.1
Annotations: meta.helm.sh/release-name: argo
meta.helm.sh/release-namespace: argo
data:
metricsConfig:
enabled: true
path: /metrics
port: 9090
ignoreErrors: false
secure: false
persistence:
archive: true
archiveTTL: 60d
postgresql:
database: argo
host: <REDACTED>
passwordSecret:
key: password
name: argo-postgres-config
port: 5432
ssl: false
sslMode: disabled
tableName: argo_workflows
userNameSecret:
key: username
name: argo-postgres-config
workflowDefaults:
spec:
ttlStrategy:
secondsAfterFailure: 604800
sso:
issuer: https://<REDACTED>
clientId:
name: argo-server-clientid
key: clientid
clientSecret:
name: argo-server-clientsecret
key: clientsecret
redirectUrl: https://<REDACTED>/oauth2/callback
rbac:
enabled: true
scopes:
- email
- profile
- openid
issuerAlias: https://<REDACTED>
sessionExpiry: 240h
nodeEvents:
enabled: false
Resource Usage
Version
v3.4.10
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
N/A Since it starts randomly, we usually don't have any workflows running
Logs from the workflow controller
Before (nothing):
After (Lines like that appear several times - I only have it in the container output):
time="2023-08-23T09:41:01.370Z" level=error msg="invalid config map object received in config watcher. Ignored processing"
Logs from in your workflow's wait container
N/A
We don't have this error in the old version.
What was your old version? 3.4.9? If it is a regression, knowing between which two versions would be very helpful
What was your old version? 3.4.9?
No, this old version is v3.0.2.
This same behavior also happened with the 2 last versions (we made some downgrades to try to evaluate).
We just installed the last version to validate if everything would be smooth, so all workflows were good, it's the only issue we had.
Gotcha. Been a lot of changes since 3.0.2. Though I took a look and this specific code hasn't been changed in years since it was written.
Only thing I can potentially guess at is maybe your ConfigMap has some old config from 3.0.2 that is no longer valid in 3.4.10? Not sure why it wouldn't error earlier though
Thanks a lot for your help :D
I just compared my values with chart values and I don't have any unknown parameters.
Do you have any idea how could I debug more this error? Like figuring out the kind of object this code is receiving or any kind of data
Could try to run a debugger and inspect the variables then. Or could try to modify the code to log it out, but trying to log may itself fail.
I updated the code to try to register the "invalid object" that is causing the error. What is happening is that the event variable is null. I don't have another error log. I'm using AWS EKS, so I also checked its log and I have no errors.
So I have no idea about this :(
Maybe this issue helps explain this behavior (I also found other complaints similar to this for projects other than Argo), if so I think we would add a conditional check before casting the object:
// [...]
select {
case event := <-retryWatcher.ResultChan():
if event.Object == nil {
return
}
cm, ok := event.Object.(*apiv1.ConfigMap)
// [...]
I'm assuming with this change the defer will be executed and a new instance of the watcher will be executed.
Does it make sense to you @agilgur5 ?
Hmmm interesting find!
if event.Object == nil {
Didn't you say that event
itself was nil?
Maybe this issue helps explain this behavior (I also found other complaints similar to this for projects other than Argo), if so I think we would add a conditional check before casting the object:
So in that issue they suggest using a Reflector or Informer instead to handle several kinds of edge cases including this one. The Argo codebase does use Informers very frequently already, so maybe we should just convert this one. The RetryWatcher
source code does mention using Informers instead as well for certain edge cases.
We're running 3.4.9 and have seen this happen twice in the last month, where the controller starts logging multiple times per second the line level=error msg="invalid config map object received in config watcher. Ignored processing"
and starts consuming a lot more resources, and the problems do not go away until we delete the pod. The config map itself is properly configured, and after a few days of running ok, it suddenly begins to spam that error log.
For context, the watcher that logs that error is actually watching the wrong namespace when both --namespaced
and --managed-namespace=<other-namespace>
are used (see https://github.com/argoproj/argo-workflows/issues/11463). I have no idea whether that is related to the nil event or not, but it leads to the watcher actually watching a non-existent config map.
So in that issue they https://github.com/kubernetes/client-go/issues/334#issuecomment-370561966 using a Reflector or Informer instead to handle several kinds of edge cases including this one. The Argo codebase does use Informers very frequently already, so maybe we should just convert this one. The RetryWatcher source code does mention using Informers instead as well for certain edge cases.
@agilgur5 I'm testing the code below, so far everything is fine, but I'm not sure if it's the best approach:
func (wfc *WorkflowController) runConfigMapWatcher() {
defer runtimeutil.HandleCrash(runtimeutil.PanicHandlers...)
ctx := context.Background()
factory := informers.NewSharedInformerFactoryWithOptions(
wfc.kubeclientset,
0,
informers.WithNamespace(wfc.managedNamespace),
)
informer := factory.Core().V1().ConfigMaps().Informer()
stopper := make(chan struct{})
defer close(stopper)
informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
UpdateFunc: func(obj interface{}, newObj interface{}) {
cm := newObj.(*apiv1.ConfigMap)
log.Infof("CM Name: %s", cm.GetName())
if cm.GetName() == wfc.configController.GetName() && wfc.namespace == cm.GetNamespace() {
log.Infof("Received Workflow Controller config map %s/%s update", cm.Namespace, cm.Name)
wfc.UpdateConfig(ctx)
}
wfc.notifySemaphoreConfigUpdate(cm)
},
})
go informer.Run(stopper)
if !cache.WaitForCacheSync(stopper, informer.HasSynced) {
log.Error(fmt.Errorf("Timed out waiting for caches to sync"))
return
}
<-stopper
}
```
@agilgur5 I'm testing the code below, so far everything is fine, but I'm not sure if it's the best approach:
A bit hard to read as there isn't syntax highlighting and the indentation got screwed up a little, but that roughly looks correct to me. It also looks similar to other parts of the codebase.
Only thing I can think of is that AddFunc
and DeleteFunc
might be necessary for if the ConfigMap gets entirely deleted or added during runtime. But I'm not sure if the old code even handled that and it is an edge of an edge case, so not super critical.
Admittedly, I am not the most familiar with the internal workings here. If you want to submit a PR, we can certainly get others to give potentially more nuanced feedback (or not if it all checks out)
but I'm not sure if it's the best approach
Is there anything in particular that you were concerned about implementation-wise?
I waited a bit to see if the error would appear again. So far this code has solved the initial problem, but a new message has started to appear in the log (every 1min):
listers.go:79] can not retrieve list of objects using index : Index with name namespace does not exist
Even with this message, everything still works well, but I believe I made some mistakes that generated this error.
@agilgur5 or does anyone have an idea about that?
PS. Without my change this message will not appear
Not sure about the exact error message, but maybe the resync period duration of 0
from your previous code comment is too short? I think most of the existing informers in the codebase are longer (and also it seems like the resync period parameter is very confusing to informer users from a quick search -- it is not the same as a re-list).
Also there is a separate issue mentioned above re: managed namespace that potentially impacts it. If you change it to use regular NS and that fixes it, then the error would be due to #11463
We also just updated to 3.4.10 and see this problem as well. any workarounds available? Should we downgrade to 3.4.8?
I tried to increase the resync period and I also tried to define the namespace manually, but the message still appeared every minute. I think this log message comes from this code.
I'm not sure if it's a real error, but I think so because I hadn't this log before
I also tested with the latest version and the error still remains.
I'm starting to think I'm doing something wrong in the image construction process.
I tried cloning the repo and running the command make workflow-controller-image
(without file changes), and the error message listers.go:79] can not retrieve list of objects using index : Index with name namespace does not exist
keeps appearing, but if I use the official image it doesn't.
I tried using Windows and Mac M1(using --platform linux/amd64 parameter) as host.
PS. The build task ends without error.
There is some problem with the local environment (I don't know why). I downloaded the latest version code and made the same change. now the core is working fine without the weird message.
I'll wait some time and create a PR.
We also just updated to 3.4.10 and see this problem as well. Should we downgrade to 3.4.8?
@roitalpaz this seems to have existed well before 3.4.10. as I mentioned above, this code path hasn't changed in years
I'd guess that occurrence was incidental as such, especially as there are no new features in 3.4.10 (if there was a configmap feature that might add up). Perhaps this issue just happens more frequently after Controller restarts
any workarounds available?
as mentioned in the opening comment, restarting your Controller Pod will resolve this
We are currently running into the same issue and it breaks our whole logging system. Maybe it's in general a good idea to change the log level to Debug in places which could generate such high load to the logging system: https://github.com/argoproj/argo-workflows/blob/206c901c6ff122ad06a32820763e504cc5caf190/workflow/controller/controller.go#L414 ?
Maybe it's in general a good idea to change the log level to Debug
Log levels should be set based on their context. This is an error and so an error level is correct. Errors are meant to be highly visible, you wouldn't want them to be silent or easily missed. Debug logs are mainly supposed to be helpful when debugging and not normally visible.
in places which could generate such high load to the logging system
This is not necessarily something you can tell ahead of time. In this case, this is an edge case with RetryWatcher
s that is apparently unhandled. The code did not expect that and so it logs an error. Because it logged this error a bug report was filed and it is now being fixed.
I've started observing same behaviour two days after I updated EKS 1.24 -> 1.28. Argo workflow version: v3.3.5
We also are experiencing this issue, overloading our logging system after upgrading.
Are there any other implications to this issue? We can drop the log, or wait with version upgrades to our production environments until this is resolved, which is not ideal of course
Thank you
Running 3.5.1 on EKS 1.28.3, same issue.
Seeing this with 3.5.1 too, we had to add a log index exclusion filter in DataDog for the workflow-controller as it would consume ALOT of log volume.
Seeing the same issue in EKS. Argo-controller is producing ~1.8 Mb/s of logs.
EKS: v1.28.3-eks-4f4795d Argo controller: v3.4.4
It seems this is only happening with kubernetes version 1.27 or later. Does anyone have evidence of it happening on earlier versions?
We are still using client-go from 0.24, which should not be compatible with 1.27, so perhaps that's the cause here. https://github.com/argoproj/argo-workflows/blob/6c0697928f92160f84e2fc6581551d26e13c0dc5/go.mod#L70
It seems this is only happening with kubernetes version 1.27 or later. Does anyone have evidence of it happening on earlier versions?
👀 I was wondering why we suddenly got more reports of this.
We are still using client-go from 0.24, which should not be compatible with 1.27, so perhaps that's the cause here.
The informer change also appears to fix it and we should still make that change, but if there's a simpler interim fix if that'd be great.
k8s is backward compatible with all clients (see also the client-go
compat docs), so I'm not so sure that upgrading client-go
would fix it. It's possible they handled this RetryWatcher
edge case better in newer versions if they've seen this more frequently in newer k8s versions, but one would think that would get backported as a fix per the client-go
policy.
maybe this k8s 1.27 change is related? https://github.com/kubernetes/kubernetes/pull/115620
maybe this k8s 1.27 change is related? kubernetes/kubernetes#115620
that seems to be an Informer change, whereas the current code uses a RetryWatcher
(it is being updated to an Informer in #11865)
Is there a workaround for this issue ? Our logs are being flooded with the particular error.