kube-mgmt
kube-mgmt copied to clipboard
replicate cache hanging on configmaps
I am using a fairly large cluster with some configmaps that are constantly updating. When I add the --replicate=v1/configmaps
flag to kube-mgmt, the watch seems to hang for hours (I am seeing around 3+ hours).
The only conclusion I can come to is that because of the constantly updating configmaps, the ResultChan is not ever returning?
https://github.com/open-policy-agent/kube-mgmt/blob/6da0e7ea2465b5cc2e3bdf1791186f7637a710b7/pkg/data/generic.go#L185-L223
The only other thing I can think is that because of the throttling issue we have on the cluster and the channel is failing out.
I0426 16:36:41.173364 49052 request.go:665] Waited for 1.189022768s due to client-side throttling, not priority and fairness, request: GET:
Any help here would be appreciated, thanks.
Throttling is not the issue, I was able to replicate this on a local kind cluster without throttling issues. I can see there is a channel error
time="2022-04-27T17:22:49Z" level=info msg="Syncing v1/configmaps."
time="2022-04-27T17:22:53Z" level=info msg="Listed v1/configmaps and got 65 resources with resourceVersion 68957. Took 4.4922964s."
time="2022-04-27T17:22:53Z" level=info msg="Loaded v1/configmaps resources into OPA. Took 304.392ms. Starting watch at resourceVersion 68957."
time="2022-04-27T18:18:33Z" level=info msg="Sync channel for v1/configmaps closed. Restarting immediately."
time="2022-04-27T18:18:33Z" level=info msg="Syncing v1/configmaps."
time="2022-04-27T18:18:33Z" level=info msg="Listed v1/configmaps and got 65 resources with resourceVersion 85312. Took 17.362ms."
time="2022-04-27T18:18:33Z" level=info msg="Loaded v1/configmaps resources into OPA. Took 62.9825ms. Starting watch at resourceVersion 85312."
time="2022-04-27T19:03:56Z" level=info msg="Sync channel for v1/configmaps closed. Restarting immediately."
time="2022-04-27T19:03:56Z" level=info msg="Syncing v1/configmaps."
time="2022-04-27T19:03:56Z" level=info msg="Listed v1/configmaps and got 65 resources with resourceVersion 98147. Took 17.3734ms."
time="2022-04-27T19:03:56Z" level=info msg="Loaded v1/configmaps resources into OPA. Took 56.3345ms. Starting watch at resourceVersion 98147."
time="2022-04-27T19:50:35Z" level=info msg="Sync channel for v1/configmaps closed. Restarting immediately."
time="2022-04-27T19:50:35Z" level=info msg="Syncing v1/configmaps."
time="2022-04-27T19:50:35Z" level=info msg="Listed v1/configmaps and got 65 resources with resourceVersion 111433. Took 42.1965ms."
time="2022-04-27T19:50:35Z" level=info msg="Loaded v1/configmaps resources into OPA. Took 182.5547ms. Starting watch at resourceVersion 111433."
time="2022-04-27T20:31:46Z" level=info msg="Sync channel for v1/configmaps closed. Restarting immediately."
time="2022-04-27T20:31:46Z" level=info msg="Syncing v1/configmaps."
time="2022-04-27T20:31:46Z" level=info msg="Listed v1/configmaps and got 65 resources with resourceVersion 123096. Took 31.6638ms."
time="2022-04-27T20:31:46Z" level=info msg="Loaded v1/configmaps resources into OPA. Took 144.7732ms. Starting watch at resourceVersion 123096."
@mvaalexp still happening in the latest version ?
We also have the same issue, the container just hung for few days on
time="2022-11-03T17:23:19Z" level=info msg="Loaded 701 resources of kind v1/configmaps into OPA. Took 265.538471ms"
And we are on the latest version 7.1.1
.
Is this using the wrong informer?
https://github.com/open-policy-agent/kube-mgmt/blob/7.1.1/pkg/data/generic.go#L126
// NewInformer returns a Store and a controller for populating the store
// while also providing event notifications. You should only used the returned
// Store for Get/List operations; Add/Modify/Deletes will cause the event
// notifications to be faulty.
It specifically says only to use this for Get/List
NewSharedIndexInformer says its specifically for listwatcher
// NewSharedIndexInformer creates a new instance for the listwatcher.
// The created informer will not do resyncs if the given
// defaultEventHandlerResyncPeriod is zero.
Here is a cluster with 2000 configmaps and it only ever makes it thru 27 of them to my raw http receiver.
grep 'kubernetes/configmaps' /tmp/config-replicate |grep -v HTTP |cut -d \: -f2 |sort |uniq -c |sort -nr |head -5
157 /v1/data/kubernetes/configmaps/test/uno
157 /v1/data/kubernetes/configmaps/test/dos
156 /v1/data/kubernetes/configmaps/test/tres
155 /v1/data/kubernetes/configmaps/test/cuatro
155 /v1/data/kubernetes/configmaps/test/cinco
grep 'kubernetes/configmaps' /tmp/config-replicate |grep -v HTTP |cut -d \: -f2 |sort |uniq -c |wc -l
27
I think this is what the authors upstream meant by faulty. The cache state is changing too rapidly and resetting the list?
I made a PR based on the client-go comment but i think the whole diagnosis may have been wrong. It looks like perhaps we were just attacking ourselves with leader election configmaps which may have in some way overloaded OPA itself.
As to why we dont see all configmaps load ever... i suspect that the LIST function is never initialized like this: https://github.com/open-policy-agent/kube-mgmt/pull/183/files#diff-10432976b832c2bf005c1a2b317bff34f985023e0e2e43650c2cfd1fbcb1f7faR151-R155
Im leaving the PR in draft because i think this possibly may be a non-issue with kube-mgmt given the very high volume of uploads to OPA that should be eliminated first.
I have debug logging on my branch ignoreNamespaces
which may explain why we never see loads.
time="2023-02-23T01:37:08Z" level=debug msg="queue length: 873"
time="2023-02-23T01:37:11Z" level=debug msg="queue length: 872"
time="2023-02-23T01:37:13Z" level=debug msg="queue length: 871"
time="2023-02-23T01:37:16Z" level=debug msg="queue length: 870"
time="2023-02-23T01:37:19Z" level=debug msg="queue length: 869"
time="2023-02-23T01:37:22Z" level=debug msg="queue length: 868"
time="2023-02-23T01:37:24Z" level=debug msg="queue length: 867"
time="2023-02-23T01:37:27Z" level=debug msg="queue length: 866"
time="2023-02-23T01:37:29Z" level=debug msg="queue length: 865"
time="2023-02-23T01:37:32Z" level=debug msg="queue length: 864"
time="2023-02-23T01:37:34Z" level=debug msg="queue length: 863"
time="2023-02-23T01:37:36Z" level=debug msg="queue length: 862"
This is with the loader fixed to ignore leader elections + different namespaces that may not have leader elects labeled correctly. You can see here its taking ~2 seconds per request. So if the reloads are NOT bypassed the queue never empties and kube-mgmt falls so far behind it can never finish the queue.
I'm going to link this to https://github.com/open-policy-agent/kube-mgmt/pull/186 as i'm pretty certain this was of our own doing and will work to tidy up whatever i can to make this mergable.