operator-sdk Scaling the ansible operator for six custom resources with 1000s of objects

Scaling the ansible operator for six custom resources with 1000s of objects

Open stiller-leser opened this issue 2 years ago • 21 comments

Type of question

Best practices

Question

I am currently looking into the performance of my operator and how it could be improved, as for one of my CRs I am seeing up to 25 minutes until it is first reconciled.

Looking into the /metrics endpoint of my operator I found:

# HELP ansible_operator_reconciles How long in seconds a reconcile takes.
# TYPE ansible_operator_reconciles histogram
ansible_operator_reconciles_sum{GVK="ops.com/v1alpha1, Kind=Artifact"} 4.702623642999996
ansible_operator_reconciles_sum{GVK="ops.com/v1alpha1, Kind=Operation"} 13.63416719100003
ansible_operator_reconciles_sum{GVK="ops.com/v1alpha1, Kind=Project"} 9.061215722000007
ansible_operator_reconciles_sum{GVK="ops.com/v1alpha1, Kind=Snapshot"} 8.568096267000008
ansible_operator_reconciles_sum{GVK="ops.com/v1alpha1, Kind=System"} 3.2389071760000006

Do I interpret this correctly, that one reconciliation of my "Operation" resource takes almost 14 seconds?

In general, the operator takes a lot of memory from my node (8 cpus):

$ kubectl top nodes
NAME                                                        CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
...
my-operator-node-some-id                       8004m        101%            7018Mi          22% 
...

In general my operator is configured with --max-concurrent-reconciles=8

The operator itself mostly relies on the kubernetes.core modules. Activating the turbo mode led to my operator beeing OOM killed every few minutes. "All" the operator does is CRUDing some kubernetes native resources (i.e. no CRs in that sense).

Any advice on how to debug my operator performance is highly appreciated.

Environment

Operator type:

/language ansible

Kubernetes cluster type:

vanilla

$ operator-sdk version

ansible-operator version: "v1.12.0", commit: "d3b2761afdb78f629a7eaf4461b0fb8ae3b02860", kubernetes version: "v1.21", go version: "go1.16.8", GOOS: "linux", GOARCH: "amd64"

$ go version (if language is Go)

not present

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:38:50Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3", GitCommit:"ca643a4d1f7bfe34773c74f79527be4afd95bf39", GitTreeState:"clean", BuildDate:"2021-07-15T20:59:07Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}

Sep 27 '21 16:09 stiller-leser

I am currently looking into the performance of my operator and how it could be improved, as for one of my CRs I am seeing up to 25 minutes until it is first reconciled.

@stiller-leser the time to first reconcile is likely due to re-reconciliation on startup, which will take a while with that number of resources in the cluster. I'm not sure off the top of my head how much work it would take but to fix that we'd have to find a way to prioritize new, unreconciled resources over re-reconciliation to fix drift while the operator was non-functional. If not supported already (I don't think it is) that would likely have to be done in controller-runtime.

Do I interpret this correctly, that one reconciliation of my "Operation" resource takes almost 14 seconds?

Yes, I think you read that correctly

In general, the operator takes a lot of memory from my node (8 cpus):

That does seem quite high. Is this relatively stable or does it rise over time? Generally the informer cache should be the largest persistent consumer of memory, but 8GB seems excessive. You only have 8 simultaneous reconciliations so unless each ansible run is consuming 1GB of RAM I'd expect it to be something in the Golang ansible-operator binary. Does setting watchDependentResources to false have any impact on memory pressure?

The operator itself mostly relies on the kubernetes.core modules. Activating the turbo mode led to my operator beeing OOM killed every few minutes.

That is interesting, if anything I would have expected it to ease memory pressure. I've asked the ansible folks if they have any idea why that may be, they were curious if you could share a playbook that is triggering that issue.

Any advice on how to debug my operator performance is highly appreciated.

Unfortunately at the moment we don't have any really good utilities for this. We've been looking into ways to output performance related metrics from controller-runtime but there's nothing in place at the moment. Currently the only way to get more information would be to add some instrumentation to the ansible-operator code and compile and use that custom binary. Happy to work more with you on this, I don't think we've looked at an ansible-based operator performing at this scale before, but I am definitely very interested in figuring out how to make this work better for you. I'll see what information I can pull together in the meantime.

Sep 27 '21 19:09 fabianvf

Hi,

thank you for your detailed answer. This is very much appreciate. I am not sure if I could share any playbooks. I'll get back to you on that.

First, I misspoke: The operator is not consuming a lot of memory, but all available CPU. See output of kubectl top nodes.

I just did a quick calculation of the current amount of objects. Right now we have about:

414 Snapshot CRs
25 Artifact CRs
29 Operation CRs
12 System CRs
18 Project CRs
18 Namespaces
67 Networkpolicies
21 Resourcequotas
438 Volumesnapshots
799 Volumesnapshotcontents
12 PVCs
12 PV
12 Services
12 Deployments
12 Replicasets
2 Jobs
126 pods

That's a total of 2011 objects and there are probably a couple of configmaps and secrets that would need to be added in the mix as well. The huge amount of Volumesnapshots / Volumesnapshotcontents created by our Snapshot CR (which is created by the Artifact CR) is to "blame" on the fact that K8s does not allow a Volumesnapshotcontent to be used by multiple Volumesnapshots. Since we work on a namespace basis (wrapped in the Project CR), we need to replicate the Volumesnapshot(contents) combi in every namespace. Hence, it easily grows with any new Project CR.

Do I understand watchDependentResources correct that if "disabled" only my CRs would be watched? Would the created K8s native resources still be deleted if the parent CR is deleted? As I do not rely on being able to e.g. determine the pod health (as the K8s deployment would do that for me), it might be an option to reduce some load. The question would be how effective that would be though?

Since I am currently only relying on the all resources but the System to be only created (or updated until a certain point) and not to be reconciled (as we are dealing with "pets" instead of "cattles"), I am thinking about skipping any further reconciliation once this point is reached and only re-start reconciliation if the state changes to absent. Would this be a viable approach (aside from the fact that the point at which I skip the reconciliation would need to be well considered)?

EDIT Is there a way to tell if it is the first reconciliation besides checking if the resource to be created already exists? This way I could exclude some additional tasks.

Best regards, stiller-leser

Sep 28 '21 08:09 stiller-leser

First, I misspoke: The operator is not consuming a lot of memory, but all available CPU. See output of kubectl top nodes.

Interesting, we've definitely done less CPU profiling than RAM. I'd be surprised if it was Ansible that was eating up all that CPU (since Ansible should mostly be waiting on network calls), I wonder if it's just the quantity of events being processed by the Go binary.

Do I understand watchDependentResources correct that if "disabled" only my CRs would be watched?

Yes, that's correct

Would the created K8s native resources still be deleted if the parent CR is deleted? As I do not rely on being able to e.g. determine the pod health (as the K8s deployment would do that for me), it might be an option to reduce some load. The question would be how effective that would be though?

Yes, it's the ownerReferences we set on the dependent resources that lead to them being cleaned up on deletion by the Kubernetes garbage collector. I'm not sure what the impact would be, I was mostly asking because if it significantly eases the problem it at least narrows our profiling/performance improvement to the dependent watches.

Since I am currently only relying on the all resources but the System to be only created (or updated until a certain point) and not to be reconciled (as we are dealing with "pets" instead of "cattles"), I am thinking about skipping any further reconciliation once this point is reached and only re-start reconciliation if the state changes to absent. Would this be a viable approach (aside from the fact that the point at which I skip the reconciliation would need to be well considered)?

Yeah, that makes perfect sense to me. No need to run tasks that you don't actually want to do any work.

EDIT Is there a way to tell if it is the first reconciliation besides checking if the resource to be created already exists? This way I could exclude some additional tasks.

You could use the operator_sdk.util.k8s_status module to write to your custom resource's status once you've successfully reconciled the resource, and check the status field at the beginning of your reconciliation logic.

Sep 28 '21 16:09 fabianvf

a way to prioritize new, unreconciled resources over re-reconciliation to fix drift while the operator was non-functional

Interesting idea. Facing a similar issue with performance but with less CRs. I think something like this could help mitigate part of the effects we get, where the operator is busy reconciling existing CRs for a considerable amount of time. I just don't know the best way to implement it, yet. For the moment, we are looking for ways to optimize reconciliation runs, modifiying ansible tasks.

Oct 02 '21 05:10 jobcespedes

We have tried to do implement some improvements, mainly by reducing the amount of updates. So far, we did not see any significant improvement, with the CPU still pinned to 100%. We tried do reduce the max-concurrent-reconciles to 6, but this did not help. One question here: Is the number of reconciles per CR or in total, i.e 6 * 5 CRs or 6 in parallel?

It would really help, if there would be a way to analyze the amount of resources or at least time one task takes up.

Oct 04 '21 17:10 stiller-leser

One question here: Is the number of reconciles per CR or in total, i.e 6 * 5 CRs or 6 in parallel?

Checking this line, I believe it is per controller (there is a controller per CRD). Hopefully, someone else could confirm.

Oct 04 '21 18:10 jobcespedes

And one CR is controlled by one controller? If so, is there a way to distinguish which controller gets reconciled how often? I am very hesitant to split the operator to improve performance.

From: Job Cespedes @.> Sent: Monday, October 4, 2021 8:05:59 PM To: operator-framework/operator-sdk @.> Cc: Kaj-Sören Mossdorf @.>; Mention @.> Subject: Re: [operator-framework/operator-sdk] Scaling the ansible operator for six custom resources with 1000s of objects (#5246)

One question here: Is the number of reconciles per CR or in total, i.e 6 * 5 CRs or 6 in parallel?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/operator-framework/operator-sdk/issues/5246#issuecomment-933725925, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAVWM576V5XZRAWKVTQSCXTUFHUIPANCNFSM5E25TARQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Oct 04 '21 18:10 stiller-leser

The total number of concurrent reconciles are per controller and not per custom resource (CR), in my understanding. In fact, there could not be more that one reconcile at a time per CR. This is by design, for avoiding conflicting changes: a CR only gets one reconcile loop at a time (check this line).

Oct 04 '21 18:10 jobcespedes

It would really help, if there would be a way to analyze the amount of resources or at least time one task takes up.

There is one way, using Ansible's profile_tasks that you can set using a environment variable ANSIBLE_CALLBACK_WHITELIST=profile_tasks (Anible 2.9) or in ansible.cfg. Then, some stats will appear in the artifact directory of each CR's inside the operator container. Ex: cat /tmp/ansible-operator/runner/m4e.krestomat.io/v1alpha1/M4e/site-sample-us-east1-0-ns/site-sample-us-eas/artifacts/latest/stdout.

On this topic, ansible operator performance, there is an interesting and related blog post with some tips. A promising improvement could be the upcoming Ansible turbo mode

Oct 04 '21 18:10 jobcespedes

Just a short update from today: We reduced the --max-concurrent-reconciles flag to 1 (later made it configurable on kind basis). The CPU usage dropped as expected. Next we will run some profiling to see what impact this has on the reconciliation times of our CRs. As we are not pinning the CPU anymore, it seems that we could drastically reduce the runtime already.

Unfortunately, the TURBO_MODE currently only leads to our operator beeing OOM killed. Not sure why this is the case, but will take up the investigation later.

The profile_tasks was really interesting to see. I need to dig deeper though tomorrow, maybe I'll try and surface this info via a nginx or so. It would actually be really cool to have the info available at the /metrics point if profile_tasks is activated. I guess the output might be a little hard to parse though...

Thank you all for your tips!

Oct 05 '21 16:10 stiller-leser

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Jan 03 '22 16:01 openshift-bot

/remove-lifecycle stale

Jan 06 '22 13:01 jacksgt

@stiller-leser now that TURBO mode is released could you give it another try? OOM errors there would be worth filing a separate issue.

Jan 17 '22 17:01 asmacdo

Hi @asmacdo, great news. We are testing it as we speak. I will follow up with results as soon as I have some.

Jan 18 '22 13:01 stiller-leser

Hi @asmacdo, returning with this picture:

grafik

That's memory being eaten up by the turbo mode (at least we start to see this behaviour as soon as we activate the turbo mode). Each drop in the line is a OOM restart of the container.

The last log I see in a dying container is

E0119 11:20:45.945509       8 leaderelection.go:367] Failed to update lock: context deadline exceeded
I0119 11:20:46.422633       8 leaderelection.go:283] failed to renew lease operator/my-operator-leader-elect: timed out waiting for the condition
{"level":"error","ts":1642591246.8350968,"logger":"cmd","msg":"Proxy or operator exited with error.","Namespace":"","error":"leader election lost","stacktrace":"github.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/[email protected]/command.go:974\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/[email protected]/command.go:902\nmain.main\n\t/workspace/cmd/ansible-operator/main.go:40\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225"}

Here are the current versions of everything:

ansible-operator version
ansible-operator version: "v1.16.0", commit: "560044140c4f3d88677e4ef2872931f5bb97f255", kubernetes version: "v1.21", go version: "go1.16.13", GOOS: "linux", GOARCH: "amd64"

ansible --version
ansible [core 2.12.1]
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/usr/share/ansible/openshift']
  ansible python module location = /usr/local/lib/python3.8/site-packages/ansible
  ansible collection location = /opt/ansible/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/local/bin/ansible
  python version = 3.8.8 (default, Aug 11 2021, 06:52:42) [GCC 8.5.0 20210514 (Red Hat 8.5.0-3)]
  jinja version = 3.0.1
  libyaml = True

ansible-galaxy [core 2.12.1]
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/usr/share/ansible/openshift']
  ansible python module location = /usr/local/lib/python3.8/site-packages/ansible
  ansible collection location = /opt/ansible/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/local/bin/ansible-galaxy
  python version = 3.8.8 (default, Aug 11 2021, 06:52:42) [GCC 8.5.0 20210514 (Red Hat 8.5.0-3)]
  jinja version = 3.0.1
  libyaml = True

ansible-galaxy collection list
[DEPRECATION WARNING]: ANSIBLE_CALLBACK_WHITELIST option, normalizing names to new standard, use ANSIBLE_CALLBACKS_ENABLED instead. This feature will be removed from ansible-core in version 2.15. Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg.

# /opt/ansible/.ansible/collections/ansible_collections
Collection        Version
----------------- -------
cloud.common      2.1.0  
kubernetes.core   2.2.2  
operator_sdk.util 0.3.1

If I can provide more debug output, I am happy to do so. Just let me know which flags need to be set.

Jan 19 '22 11:01 stiller-leser

Hello Folks,

I was just chatting with @fabianvf the other day and he mentioned this issue.

Is there something that we could do with metadata only caching of the object? this would probably require some extra work to do right because of the GET's that the ansible does and the compare (the spec and status will always be different because in the cache we will have an empty object. Given some degree of buy in I could write up a proposal?

Feb 20 '22 02:02 shawn-hurley

@shawn-hurley by all means, we'd love to see a proposal for this. I think it would be valuable. We've been using the operator-framework enhancements repo for most things but in reality that's probably better for things that affect the whole of operator-framework. So a new issue with the proposal might be okay as well. Whichever has a lower barrier of entry.

Feb 22 '22 17:02 jmrodri

Hey,

just checking in on the issue to see if there is any plan to look into this. We just tried to enable it again with the latest ansible and operator-sdk and ran into an OOM in seven minutes.

If not, we might need to consider to re-implement our operator in go.

Aug 31 '22 10:08 stiller-leser

So we continued to look into it and it seems that the OOM comes from new processes /usr/bin/python3.8 -m ansible_collections.cloud.common.plugins.module_utils.turbo.server --fork --socket-path /opt/ansible/.ansible/tmp/turbo_mode.kubernetes.core.socket being created until the pod OOMs. Seems that their termination isn't fast enough.

Sep 21 '22 09:09 stiller-leser

An strace of such a process (one which does not get killed properly), shows:

strace -p 25927
strace: Process 25927 attached
epoll_wait(3, [], 2, 4698)              = 0
epoll_wait(3, [], 2, 0)                 = 0
unlink("/opt/ansible/.ansible/tmp/turbo_mode.kubernetes.core.socket") = -1 ENOENT (No such file or directory)

The process then continues to run. It seems that the socket appears and disappears quite frequently.

Seems like this is coming from https://github.com/ansible-collections/cloud.common/blob/ebedf03db5c8b42c97cabba130a5fd477767af87/plugins/module_utils/turbo/server.py#L375

Sep 21 '22 11:09 stiller-leser

Adding an if-statement here https://github.com/stiller-leser/cloud.common/blob/main/plugins/module_utils/turbo/server.py#L375 seems to solve the OOM issue. Not sure if that's the proper handling though.

Sep 21 '22 14:09 stiller-leser

Unfortunately, another issue arises, where during the start of the socket, this condition might occur: https://github.com/stiller-leser/cloud.common/blob/main/plugins/module_utils/turbo/common.py#L60 - We've tried to increase the sleep to reduce the chance of it occuring, but unfortunately had no luck.

Would there be a way of falling back to the standard way of calling the module in case the socket can't be created?

Oct 19 '22 08:10 stiller-leser

Issues go stale after 90d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Mar 21 '23 01:03 openshift-bot

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

Apr 20 '23 08:04 openshift-bot

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

May 21 '23 00:05 openshift-bot

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

May 21 '23 00:05 openshift-ci[bot]

operator-sdk operator-sdk copied to clipboard

Scaling the ansible operator for six custom resources with 1000s of objects

Type of question

Question

Environment

operator-sdk
operator-sdk copied to clipboard