awx-resource-operator icon indicating copy to clipboard operation
awx-resource-operator copied to clipboard

Set max current reconciles flag in Dockerfile

Open rooftopcellist opened this issue 2 years ago • 3 comments

Set max current reconciles flag in Dockerfile because setting it in config/manager/manager.yaml args section didn't work. I set the default low, to 2, but set ansiblejob and ansibleworkflow higher at 3. This throttling should help us not overwhelm the operator container.

  • Set max concurrent ansiblejob to 3
  • Set max concurrent ansibleworkflow to 3

Follow up for https://github.com/ansible/awx-resource-operator/pull/150

Users can increase these values by setting new env vars on the Subscription or Deployment for the operator.

rooftopcellist avatar Nov 11 '23 03:11 rooftopcellist

I want to do some testing with and without this change at scale before merging. I think our original assumption that the number of concurrent jobs was not being throttled may have been wrong based on this:

The --max-concurrent-reconciles flag can be used to override the default max concurrent reconciles, which by default is the number of CPUs on the node on which the operator is running.

  • https://sdk.operatorframework.io/docs/building-operators/helm/reference/advanced_features/max_concurrent_reconciles/

rooftopcellist avatar Nov 14 '23 21:11 rooftopcellist

@rooftopcellist we can sync up on this another time but this line makes me concerned: default is the number of CPUs on the node on which the operator is running. I don't think we were seeing that behavior when we tested without the max_reconciles set.

rebeccahhh avatar Nov 15 '23 19:11 rebeccahhh

@rebeccahhh afaik, it is not basing that off of the requests/limits for the resource operator pod, it is instead basing that off of the number of CPU's on the node the resource operator pod is scheduled on from what I can tell.

And for that, it seems to be correct:

$ oc get node aap-dev-8scgk-worker-a-9v8vw -o yaml | grep cpu:
    cpu: 3500m
    cpu: "4"

We were seeing 4 workers get set for each resource by default. I think it does this because number of parallel workers you can have is more based off of the number of CPU cores, rather than the capacity of each core (or in our case, the capacity allocated to the pod).

rooftopcellist avatar Nov 18 '23 06:11 rooftopcellist