terrajet icon indicating copy to clipboard operation
terrajet copied to clipboard

Terrajet providers take a long time on cluster with many resources

Open Kasama opened this issue 1 year ago • 2 comments

Hello team!

Me and my team are developing a solution for developers to be able to manage infrastructure resources through an API. This solution relies on crossplane to manage cloud resources as kubernetes CRs.

We've started using terrajet for AWS, but changed to the native provider because of this issue. New we also need to manage other resources for which there are no native providers (namely vault, boundary and datadog) and it would be great to use terrajet.

Existing Context

initial slack thread with problem description

What happened?

The expected behaviour is that terrajet resources time-to-readiness wouldn't depend on the amount of resources that already exist in the cluster.

In reality, the terrajet controller takes roughly 3 * (number of existing resources) * (time it takes for one resource to be reconciled) seconds to consider a new resource ready.

The examples in this issue are using provider-jet-aws, but all terrajet providers have the same behaviour, from what I can see.

Below are the debug logs related to one recently created policy (terrajetloadtestslow). It was created on a provider that is already managing 200 other aws policies.

crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630252994777613e+09       DEBUG   provider-jet-aws        Reconciling     {"controller": "managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy", "request": "/terrajetloadtestslow"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630253028353422e+09       DEBUG   provider-jet-aws        refresh ended   {"workspace": "/tmp/636bbd3e-94bb-40ae-acde-f866e095b0e6", "out": "{\"@level\":\"info\",\"@message\":\"Terraform 1.0.5\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:20.558331Z\",\"terraform\":\"1.0.5\",\"type\":\"version\",\"ui\":\"0.1.0\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Refreshing state...\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:22.822179Z\",\"hook\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"id_key\":\"id\"},\"type\":\"refresh_start\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Refresh complete\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:22.822992Z\",\"hook\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null}},\"type\":\"refresh_complete\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Drift detected (delete)\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:22.823374Z\",\"change\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"action\":\"delete\"},\"type\":\"resource_drift\"}\n{\"@level\":\"info\",\"@message\":\"Plan: 0 to add, 0 to change, 0 to destroy.\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:22.823487Z\",\"changes\":{\"add\":0,\"change\":0,\"remove\":0,\"operation\":\"plan\"},\"type\":\"change_summary\"}\n{\"@level\":\"info\",\"@message\":\"Apply complete! Resources: 0 added, 0 changed, 0 destroyed.\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:22.829034Z\",\"changes\":{\"add\":0,\"change\":0,\"remove\":0,\"operation\":\"apply\"},\"type\":\"change_summary\"}\n{\"@level\":\"info\",\"@message\":\"Outputs: 0\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:22.829081Z\",\"outputs\":{},\"type\":\"outputs\"}\n"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630253059385006e+09       DEBUG   provider-jet-aws        apply ended     {"workspace": "/tmp/636bbd3e-94bb-40ae-acde-f866e095b0e6", "out": "{\"@level\":\"info\",\"@message\":\"Terraform 1.0.5\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:22.897931Z\",\"terraform\":\"1.0.5\",\"type\":\"version\",\"ui\":\"0.1.0\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Plan to create\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:25.127798Z\",\"change\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"action\":\"create\"},\"type\":\"planned_change\"}\n{\"@level\":\"info\",\"@message\":\"Plan: 1 to add, 0 to change, 0 to destroy.\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:25.127975Z\",\"changes\":{\"add\":1,\"change\":0,\"remove\":0,\"operation\":\"plan\"},\"type\":\"change_summary\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Creating...\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:25.630013Z\",\"hook\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"action\":\"create\"},\"type\":\"apply_start\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Creation complete after 0s [id=arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow]\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:25.924980Z\",\"hook\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"action\":\"create\",\"id_key\":\"id\",\"id_value\":\"arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow\",\"elapsed_seconds\":0},\"type\":\"apply_complete\"}\n{\"@level\":\"info\",\"@message\":\"Apply complete! Resources: 1 added, 0 changed, 0 destroyed.\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:25.932091Z\",\"changes\":{\"add\":1,\"change\":0,\"remove\":0,\"operation\":\"apply\"},\"type\":\"change_summary\"}\n{\"@level\":\"info\",\"@message\":\"Outputs: 0\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:25.932195Z\",\"outputs\":{},\"type\":\"outputs\"}\n"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.663025305947192e+09        DEBUG   provider-jet-aws        Successfully requested creation of external resource {"controller": "managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy", "request": "/terrajetloadtestslow", "uid": "636bbd3e-94bb-40ae-acde-f866e095b0e6", "version": "573953565", "external-name": "", "external-name": "arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630253059475644e+09       DEBUG   events  Normal  {"object": {"kind":"Policy","name":"terrajetloadtestslow","uid":"636bbd3e-94bb-40ae-acde-f866e095b0e6","apiVersion":"iam.aws.jet.crossplane.io/v1alpha2","resourceVersion":"573965831"}, "reason": "CreatedExternalResource", "message": "Successfully requested creation of external resource"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.663025826864625e+09        DEBUG   provider-jet-aws        Reconciling     {"controller": "managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy", "request": "/terrajetloadtestslow"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630258282078466e+09       DEBUG   provider-jet-aws        refresh ended   {"workspace": "/tmp/636bbd3e-94bb-40ae-acde-f866e095b0e6", "out": "{\"@level\":\"info\",\"@message\":\"Terraform 1.0.5\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:37:06.895864Z\",\"terraform\":\"1.0.5\",\"type\":\"version\",\"ui\":\"0.1.0\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Refreshing state... [id=arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow]\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:37:08.112383Z\",\"hook\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"id_key\":\"id\",\"id_value\":\"arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow\"},\"type\":\"refresh_start\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Refresh complete [id=arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow]\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:37:08.197666Z\",\"hook\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"id_key\":\"id\",\"id_value\":\"arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow\"},\"type\":\"refresh_complete\"}\n{\"@level\":\"info\",\"@message\":\"Plan: 0 to add, 0 to change, 0 to destroy.\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:37:08.198356Z\",\"changes\":{\"add\":0,\"change\":0,\"remove\":0,\"operation\":\"plan\"},\"type\":\"change_summary\"}\n{\"@level\":\"info\",\"@message\":\"Apply complete! Resources: 0 added, 0 changed, 0 destroyed.\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:37:08.203956Z\",\"changes\":{\"add\":0,\"change\":0,\"remove\":0,\"operation\":\"apply\"},\"type\":\"change_summary\"}\n{\"@level\":\"info\",\"@message\":\"Outputs: 0\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:37:08.203989Z\",\"outputs\":{},\"type\":\"outputs\"}\n"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630258282080562e+09       DEBUG   provider-jet-aws        External resource is up to date {"controller": "managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy", "request": "/terrajetloadtestslow", "uid": "636bbd3e-94bb-40ae-acde-f866e095b0e6", "version": "573965832", "external-name": "arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow", "requeue-after": 1663025888.2080543}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630263118990378e+09       DEBUG   provider-jet-aws        Reconciling     {"controller": "managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy", "request": "/terrajetloadtestslow"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630263132372632e+09       DEBUG   provider-jet-aws        refresh ended   {"workspace": "/tmp/636bbd3e-94bb-40ae-acde-f866e095b0e6", "out": "{\"@level\":\"info\",\"@message\":\"Terraform 1.0.5\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:45:11.952075Z\",\"terraform\":\"1.0.5\",\"type\":\"version\",\"ui\":\"0.1.0\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Refreshing state... [id=arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow]\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:45:13.135934Z\",\"hook\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"id_key\":\"id\",\"id_value\":\"arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow\"},\"type\":\"refresh_start\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Refresh complete [id=arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow]\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:45:13.226661Z\",\"hook\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"id_key\":\"id\",\"id_value\":\"arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow\"},\"type\":\"refresh_complete\"}\n{\"@level\":\"info\",\"@message\":\"Plan: 0 to add, 0 to change, 0 to destroy.\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:45:13.227599Z\",\"changes\":{\"add\":0,\"change\":0,\"remove\":0,\"operation\":\"plan\"},\"type\":\"change_summary\"}\n{\"@level\":\"info\",\"@message\":\"Apply complete! Resources: 0 added, 0 changed, 0 destroyed.\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:45:13.232695Z\",\"changes\":{\"add\":0,\"change\":0,\"remove\":0,\"operation\":\"apply\"},\"type\":\"change_summary\"}\n{\"@level\":\"info\",\"@message\":\"Outputs: 0\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:45:13.232734Z\",\"outputs\":{},\"type\":\"outputs\"}\n"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630263132466455e+09       DEBUG   provider-jet-aws        External resource is up to date {"controller": "managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy", "request": "/terrajetloadtestslow", "uid": "636bbd3e-94bb-40ae-acde-f866e095b0e6", "version": "573976982", "external-name": "arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow", "requeue-after": 1663026373.2466433}

From these logs we can also see the time it took for the resource to be Ready. From 23:28:19 to 23:45:13, ~17 minutes. (using GMT time to make it easier)

It should also be noted that the resource is created in the cloud provider at 23:28:25 (12 seconds), when the logs say Successfully requested creation of external resource, but it takes a long time until the resource is marked as Ready in kubernetes

The reason why it takes so long seems to be that resource change events are not given any priority over the normal reconciliation loop. The controller stays in an endless loop reconciling every resource it knows about, which is necessary to override external changes in the providers, but any resource that is added or changed in kubernetes doesn't get immediately picked up. Rather it seems to be added to the queue and only seen after all existing reconcile loops are done, then the resource is picked up, and a creation flow starts, but the result will only be seen on the next reconciliation loop. Once that comes, terrajet will pick up the resource's status from the terraform state and requeue once more, only after this last requeue the status will be updated to Ready.

It's also worth noting that changing the MaxConcurrentReconciles configuration to something bigger, like 5 or 10 reduces the time by that factor, but the underlying behavior persists and all resources are reconciled before considering new resource or updates Ready.

What I'm having trouble understanding is how this terrajet workflow differs from the native provider-aws one, as both seem to use the same Reconcile function from crossplane-runtime.

Also, the UseAsync configuration for a resource doesn't seem to change anything. I've tried both for policies.iam and for repositories.ecr with the same behavior.


How can we reproduce it?

  • Create a new kubernetes cluster (with kind or in the cloud).
  • Install crossplane
  • Install a terrajet provider (I'll use AWS because it can be easily compared with the native provider-aws)
  • Create a handful of resources to be managed by terrajet (I'll use ECR because it is quickly created and dont incur costs)
  • Wait until all resources are created and ready it will take some minutes, but a burst of resources is expected to take a bit. Althought it does take much longer than provider-aws for the same resource.
  • Create one more resource to be managed by terrajet

The last step will take a long time, which is the problem this bug report is about.

Open collapsible for reproducible commands
# Create kind cluster
kind create cluster --name terrajet-load --image kindest/node:v1.23.10

# Install crossplane with helm
kubectl create namespace crossplane-system
helm repo add crossplane-stable https://charts.crossplane.io/stable
helm repo update
helm upgrade --install crossplane --namespace crossplane-system crossplane-stable/crossplane

# Install AWS terrajet provider
kubectl apply -f - <<YAML
apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
  name: crossplane-provider-jet-aws
spec:
  controllerConfigRef:
    name: config
  package: crossplane/provider-jet-aws:v0.5.0
---
apiVersion: pkg.crossplane.io/v1alpha1
kind: ControllerConfig
metadata:
  name: config
spec:
  args:
  - --debug
YAML
## use your aws credentials in the secret, if you have a custom way to interact with AWS, change the credentials key of this secret.
kubectl create secret generic -n crossplane-system --from-file=credentials=$HOME/.aws/credentials aws-credentials
kubectl apply -f - <<YAML
apiVersion: aws.jet.crossplane.io/v1alpha1
kind: ProviderConfig
metadata:
  finalizers:
  - in-use.crossplane.io
  name: default
spec:
  credentials:
    secretRef:
      key: credentials
      name: aws-credentials
      namespace: crossplane-system
    source: Secret
YAML

# Create a handful of resources managed by terrajet. I chose policies because they are created near-instantly in the cloud and don't incur costs
## Note: seq on MacOS doesn't seem to support the -w flag, it can be removed safely below
for n in $(seq -w 200); do
  echo "---"
  sed "s/NUMBER/$n/" <<YAML
apiVersion: iam.aws.jet.crossplane.io/v1alpha2
kind: Policy
metadata:
  name: terrajetloadtestNUMBER
spec:
  forProvider:
    name: terrajetloadtestNUMBER
    policy: |
      {
        "Version": "2012-10-17",
        "Statement": [
          {
            "Effect": "Deny",
            "Action": "*",
            "Resource": ["*"]
          }
        ]
      }
  providerConfigRef:
    name: default
YAML
done | kubectl apply -f -

# Wait until all are ready, it took about 15 minutes for me
kubectl get policies.iam.aws.jet.crossplane.io

# Create one more resource
kubectl apply -f - <<YAML
apiVersion: iam.aws.jet.crossplane.io/v1alpha2
kind: Policy
metadata:
  name: terrajetloadtestslow
spec:
  forProvider:
    name: terrajetloadtestslow
    policy: |
      {
        "Version": "2012-10-17",
        "Statement": [
          {
            "Effect": "Deny",
            "Action": "*",
            "Resource": ["*"]
          }
        ]
      }
  providerConfigRef:
    name: default
YAML

# It takes a long time for the resource to become ready.
kubectl get policies.iam.aws.jet.crossplane.io terrajetloadtestslow

########

# Cleanup cloud resources
kubectl delete policy.iam.aws.jet $(kubectl get policy.iam.aws.jet | grep terrajetloadtest | awk '{print $1}')

# Delete kind cluster
kind destroy cluster --name terrajet-load

Possibly related discussions


Please let me know if I can help figure out what's happening.

Kasama avatar Sep 13 '22 01:09 Kasama

Hello team. I want to update this issue with my latest findings.

After a while debugging and understanding a bit better how everything works, I figured the main issue seems to be that terrajet providers, in general, don't work fast enough to be able to consume the controller's workqueue faster than it's produced. Which causes the delays we're seeing as, for any change, the controller needs to go through all the workqueue before it can get to the recent change.

I've setup two side-by-side AWS providers, one using terrajet and another using provider-aws. Then I've setup a prometheus and grafana to collect the metrics exposed by the controller manager.

Open for more details on the experiment setup

Experiment environment

  • I ran the experiments in my local machine with a 12 threads CPU (Ryzen 5600X) and 16GB of ram. None were topped during the tests.
  • I created a kind cluster to run the experiments on (similar to the main message setup).
  • I ran each provider directly on my machine (from the ./_output/bin/linux_amd64/provider build artifact)
    • to allow for both providers to run at the same time, I've added the MetricsBindAddress option to the controller manager so both could expose metrics. I've used port 8080 (default) for the jet provider and 8081 for the native one
  • Unless specified, the providers were executed without additional flags or code changes
  • Relevant environment variables for jet provider:
    • TF_PLUGIN_CACHE_DIR=/tmp/terraform/.terraform to improve provider download performance
    • TERRAFORM_VERSION=1.0.5
    • TERRAFORM_PROVIDER_SOURCE=hashicorp/aws
    • TERRAFORM_PROVIDER_VERSION=3.56.0

Metrics Setup

The grafana dashboard was created pointing to the prometheus instance

Configuration files

Prometheus.yaml

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
 
  external_labels:
      monitor: 'monitor'
 
rule_files: []
 
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
 
  - job_name: 'jet-provider'
    static_configs:
      - targets: ['localhost:8080']
 
  - job_name: 'native-provider'
    static_configs:
      - targets: ['localhost:8081']

Grafana dashboard:

{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": {
          "type": "grafana",
          "uid": "-- Grafana --"
        },
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "target": {
          "limit": 100,
          "matchAny": false,
          "tags": [],
          "type": "dashboard"
        },
        "type": "dashboard"
      },
      {
        "datasource": {
          "type": "datasource",
          "uid": "grafana"
        },
        "enable": true,
        "iconColor": "blue",
        "name": "Annotations",
        "target": {
          "limit": 1000,
          "refId": "Anno",
          "tags": [
            "annotation"
          ],
          "type": "tags"
        }
      }
    ]
  },
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 0,
  "id": 2,
  "links": [],
  "liveNow": false,
  "panels": [
    {
      "collapsed": false,
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 0
      },
      "id": 10,
      "panels": [],
      "title": "Controller",
      "type": "row"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "cdbCIg4Vk"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "s"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 1
      },
      "id": 6,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "cdbCIg4Vk"
          },
          "editorMode": "builder",
          "expr": "rate(controller_runtime_reconcile_time_seconds_sum{controller=~\"$controllers\"}[$__rate_interval])",
          "hide": true,
          "legendFormat": "__auto",
          "range": true,
          "refId": "A"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "cdbCIg4Vk"
          },
          "editorMode": "builder",
          "expr": "controller_runtime_max_concurrent_reconciles{controller=~\"$controllers\"}",
          "hide": true,
          "legendFormat": "__auto",
          "range": true,
          "refId": "B"
        },
        {
          "datasource": {
            "name": "Expression",
            "type": "__expr__",
            "uid": "__expr__"
          },
          "expression": "$A/$B",
          "hide": false,
          "refId": "C",
          "type": "math"
        }
      ],
      "title": "Controller Reconcile Time",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "cdbCIg4Vk"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "max": 10,
          "min": -1,
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 1
      },
      "id": 12,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "cdbCIg4Vk"
          },
          "editorMode": "builder",
          "expr": "controller_runtime_reconcile_errors_total{controller=~\"$controllers\"}",
          "legendFormat": "__auto",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Reconcile Errors",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "cdbCIg4Vk"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "s"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 9
      },
      "id": 16,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "cdbCIg4Vk"
          },
          "editorMode": "builder",
          "expr": "histogram_quantile(0.95, sum by(le, controller) (rate(controller_runtime_reconcile_time_seconds_bucket{controller=~\"$controllers\"}[$__rate_interval])))",
          "hide": true,
          "legendFormat": "__auto",
          "range": true,
          "refId": "A"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "cdbCIg4Vk"
          },
          "editorMode": "builder",
          "expr": "controller_runtime_max_concurrent_reconciles{controller=~\"$controllers\"}",
          "hide": true,
          "legendFormat": "__auto",
          "range": true,
          "refId": "B"
        },
        {
          "datasource": {
            "name": "Expression",
            "type": "__expr__",
            "uid": "__expr__"
          },
          "expression": "$A/$B",
          "hide": false,
          "refId": "C",
          "type": "math"
        }
      ],
      "title": "95th percentile reconcile time",
      "type": "timeseries"
    },
    {
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 17
      },
      "id": 8,
      "title": "Queue",
      "type": "row"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "cdbCIg4Vk"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "ms"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 18
      },
      "id": 4,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "cdbCIg4Vk"
          },
          "editorMode": "builder",
          "expr": "rate(workqueue_queue_duration_seconds_sum{name=~\"$controllers\"}[$__rate_interval])",
          "legendFormat": "__auto",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Workqueue queue duration",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "cdbCIg4Vk"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 18
      },
      "id": 2,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "cdbCIg4Vk"
          },
          "editorMode": "builder",
          "expr": "workqueue_depth{name=~\"$controllers\"}",
          "legendFormat": "__auto",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Workqueue depth",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "cdbCIg4Vk"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 26
      },
      "id": 14,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "cdbCIg4Vk"
          },
          "editorMode": "builder",
          "expr": "rate(workqueue_adds_total{name=~\"$controllers\"}[$__rate_interval])",
          "legendFormat": "__auto",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Rate of Queue Adds",
      "type": "timeseries"
    }
  ],
  "refresh": false,
  "schemaVersion": 37,
  "style": "dark",
  "tags": [],
  "templating": {
    "list": [
      {
        "allValue": "managed/.*",
        "current": {
          "selected": false,
          "text": [
            "managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy",
            "managed/policy.iam.aws.crossplane.io"
          ],
          "value": [
            "managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy",
            "managed/policy.iam.aws.crossplane.io"
          ]
        },
        "datasource": {
          "type": "prometheus",
          "uid": "cdbCIg4Vk"
        },
        "definition": "controller_runtime_active_workers",
        "description": "Controllers to watch. Regex",
        "hide": 0,
        "includeAll": true,
        "label": "controller",
        "multi": true,
        "name": "controllers",
        "options": [],
        "query": {
          "query": "controller_runtime_active_workers",
          "refId": "StandardVariableQuery"
        },
        "refresh": 1,
        "regex": ".*controller=\"(managed/.*?)\".*",
        "skipUrlSync": false,
        "sort": 0,
        "type": "query"
      }
    ]
  },
  "time": {
    "from": "2022-10-03T04:52:47.659Z",
    "to": "2022-10-03T05:09:38.612Z"
  },
  "timepicker": {},
  "timezone": "",
  "title": "Crossplane Controllers AWS",
  "uid": "-WIVl0VVk",
  "version": 2,
  "weekStart": ""
}

In the following graphs, the green lines represent the jet provider, while the yellow line represents the native one.

Initially I've created 100 resources managed by each provider (first vertical blue line). There, we can see that the reconciliation time for the terrajet provider is much higher than the native provider. After some time I've changed the jet provider to have a MaxConcurrentReconciles: 10 in the configs (second blue line).

reconciliation time on the first section (after the first vertical blue line), the reconciliation time is about ~14.8s for the jet provider and ~84ms for the native provider.

on the second section (after the second vertical blue line), the reconcilicaiton time is about ~997ms for the jet provider and ~84ms for the native provider. workqueue depth on the first section (after the first vertical blue line), the workqueue depth is about 100 items for the jet provider and 0 items for the native provider.

on the second section (after the second vertical blue line), the workqueue depth stabilizes at about 25 items for the jet provider and 0 for the native provider.


This left the question of "What if we bump the number of MaxConcurrentReconciles further". Bumping it to 50 we see that the reconcile time didn't change, but the number of items in the queue were reduced to 0 on the terrajet provider as well. reconcile time with 50 workers depth with 50 reconcilers


This, however, is not maintained when we create even more resources. When jumping to 500 policies on each provider, we see that the reconcile time is, again, not affected. But the workqueue size is significantly affected and cannot go below 500 for the terrajet provider, while the native provider is momentarily affected by the change, but is able to catch up and clear the queue. reconcile time 500 resources workqueue


To make sure that the reconcile time was the main culprit here, I've also locally changed the Observe method of the native provider to include a time.Sleep(1 * time.Second) and it had the same queue behavior as the terrajet provider. Which reflected in the same long time to reconcile new resources we saw in terrajet-based providers.


I'm curious to see which solutions we can brew up to this.

For now it seems that terrajet-based providers are not suitable for environments that need to manage many resources of the same type.

Kasama avatar Oct 06 '22 13:10 Kasama

I cross-posted this issue in upjet as it will (as far as I understand) replace terrajet in the future.

Kasama avatar Oct 21 '22 00:10 Kasama