kargo icon indicating copy to clipboard operation
kargo copied to clipboard

Delayed freight discovery

Open wmiller112 opened this issue 1 year ago • 4 comments

Checklist

  • [x] I've searched the issue queue to verify this is not a duplicate bug report.
  • [x] I've included steps to reproduce the bug.
  • [x] I've pasted the output of kargo version.
  • [x] I've pasted logs, if applicable.

Description

Frequently I've found that new image tags will be available that match the specified warehouse configuration, but they will not be discovered in a timely manor. The warehouse is configured with spec.interval of 2m0s, but with debug logs enabled, I see the warehouse reconcile loop happens far less frequently, but its often random. Clicking the refresh button also does not seem to reliably trigger the check for new artifacts.

Steps to Reproduce

  • Create a warehouse with spec similar to
spec:
  freightCreationPolicy: Automatic
  interval: 2m0s
  subscriptions:
    - git:
        branch: main
        commitSelectionStrategy: NewestFromBranch
        discoveryLimit: 20
        includePaths:
          - regex:path/to/manifests/.*
        repoURL: https://github.com/<org>/<repo>
    - image:
        allowTags: ^production-[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}-[0-9]{2}-[0-9]{2}Z-[0-9a-f]{40}$
        discoveryLimit: 20
        gitRepoURL: https://github.com/<org>/<repo>
        imageSelectionStrategy: Lexical
        repoURL:xxxxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/<repo>
  • Push a tag matching the allowTags schema
  • Monitor debug logs in controller to observe the refresh does not happen at the specified interval
  • Click refresh on the warehouse, confirm the annotation is added with updated ts
  • Continue to monitor debug logs to see that frequently a refresh will not happen within specified interval of clicking refresh
  • Eventually a reconcile loop will run, new tag discovered, and the rest of the process works as expected

Version

Client Version: v0.8.4
Server Version: v0.8.4

Logs

time="2024-09-18T15:48:08Z" level=debug msg="reconciling Warehouse" namespace=<project> warehouse=sources
time="2024-09-18T15:48:08Z" level=debug msg="obtained credentials for git repo" namespace=<project>repo="https://github.com/<org>/<repo>" warehouse=sources
time="2024-09-18T15:48:20Z" level=debug msg="discovered commits" count=20 namespace=<project> repo="https://github.com/<org>/<repo>" warehouse=sources
time="2024-09-18T15:48:20Z" level=debug msg="Controller IAM role is not authorized to assume project-specific role or project-specific role is not authorized to obtain an ECR auth token. Falling back to using controller's IAM role directly." awsAccountID=xxxxxxxxxx awsRegion=us-west-2 namespace=l<project> project=l<project>warehouse=sources
time="2024-09-18T15:48:20Z" level=debug msg="got ECR authorization token" awsAccountID=xxxxxxxxxx awsRegion=us-west-2 namespace=<project> project=<project> warehouse=sources
time="2024-09-18T15:48:20Z" level=debug msg="obtained credentials for image repo" namespace=<project>repo=xxxxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/<repo> warehouse=sources
time="2024-09-18T15:48:23Z" level=debug msg="discovered images" count=20 namespace=<project> repo=xxxxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/<repo> warehouse=sources
time="2024-09-18T15:48:23Z" level=debug msg="discovered latest artifacts" namespace=<project> warehouse=sources
time="2024-09-18T15:48:23Z" level=debug msg="done reconciling Warehouse" namespace=<project>warehouse=sources
...
time="2024-09-18T16:04:26Z" level=debug msg="reconciling Warehouse" namespace=<project> warehouse=sources
time="2024-09-18T16:04:26Z" level=debug msg="obtained credentials for git repo" namespace=<project>repo="https://github.com/<org>/<repo>" warehouse=sources
time="2024-09-18T16:04:39Z" level=debug msg="discovered commits" count=20 namespace=<project> repo="https://github.com/<org>/<repo>" warehouse=sources
time="2024-09-18T16:04:39Z" level=debug msg="obtained credentials for image repo" namespace=<project> repo=717232957798.dkr.ecr.us-west-2.amazonaws.com/<repo> warehouse=sources
time="2024-09-18T16:04:43Z" level=debug msg="discovered images" count=20 namespace=<project> repo=717232957798.dkr.ecr.us-west-2.amazonaws.com/<repo> warehouse=sources
time="2024-09-18T16:04:43Z" level=debug msg="discovered latest artifacts" namespace=<project> warehouse=sources
time="2024-09-18T16:04:43Z" level=debug msg="created Freight" freight=7a4a82965db54bf6d231bf6bfb68f13f39dcd9ac namespace=<project> warehouse=sources
time="2024-09-18T16:04:43Z" level=debug msg="done reconciling Warehouse" namespace=<project> warehouse=sources

wmiller112 avatar Sep 18 '24 16:09 wmiller112

Looking at this a bit more, I do see the following in the api pod logs, specifically when a manual refresh is triggered.

time="2024-10-01T18:35:17Z" level=error msg="finished streaming call" connect.code=internal connect.duration=3.79452559s connect.method=WatchWarehouses connect.service=akuity.io.kargo.service.v1alpha1.KargoService connect.start_time="2024-10-01T18:35:13Z" error="internal: context canceled"
time="2024-10-01T18:35:17Z" level=error msg="finished streaming call" connect.code=internal connect.duration=3.791075582s connect.method=WatchStages connect.service=akuity.io.kargo.service.v1alpha1.KargoService connect.start_time="2024-10-01T18:35:13Z" error="internal: context canceled"

Figured it was a resource limitation, but the controller seems to use as much memory as I give it. It was at 2Gi when I opened this, currently at 10Gi. The interval refreshes seem a lot more consistent with more memory, but the above error happens any time a manual refresh is triggered. Once a manual refresh is triggered, I sometimes observe that interval refreshes stop working as well until I manually remove the refresh annotation.

wmiller112 avatar Oct 01 '24 18:10 wmiller112

How many tags in total are there in the targeted image repository?

hiddeco avatar Oct 01 '24 20:10 hiddeco

Largest is about 1100 total tags, ~60 match the filter. ~1500 images total, some untagged, but policies keep it around this number pretty consistently. Find that it happens for projects whose warehouses have a very small number of tags as well though

wmiller112 avatar Oct 01 '24 20:10 wmiller112

Those API server logs do not appear related to this in any way. They look to me like the UI's client disconnecting abruptly.

krancour avatar Oct 02 '24 04:10 krancour

I've updated image retention. Max tags in any repo kargo uses is now 500 with around 30 matching in any repo, but issue persists.

wmiller112 avatar Oct 09 '24 17:10 wmiller112

I would be very surprised if this weren't simply that discovery is taking a very long time and rate limiting may be a factor. Out of curiosity, are you using one of the public registries?

krancour avatar Oct 10 '24 02:10 krancour

Can you share an (anonymized) Warehouse that presents this issue? Trying to see if we can narrow this down, as the heaviness of the indexing depends partly on the selectors you use.

hiddeco avatar Oct 10 '24 13:10 hiddeco

All images come from ECR. I do agree, it feels like a rate limit - it will sometimes discovery newly tagged images immediately, other times takes 40 minutes. But I've reviewed ECR rate limits in Service Quotas, and nothing stands out (highest is ~10 req/s). Also checks Github app rate limits thinking maybe its discovering image but not manifests, but those are all very low as well.

Here is the warehouse definition:

apiVersion: kargo.akuity.io/v1alpha1
kind: Warehouse
metadata:
  name: sources
  namespace: <project>
spec:
  freightCreationPolicy: Automatic
  interval: 5m0s
  subscriptions:
    - git:
        branch: main
        commitSelectionStrategy: NewestFromBranch
        discoveryLimit: 20
        includePaths:
          - regex:<app-manifest-path>/resources/.*
          - regex:<app-manifest-path>/services/.*
          - regex:<app-manifest-path>/charts/.*
        repoURL: https://github.com/<org>/<repo>
    - image:
        allowTags: >-
          ^production-[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}-[0-9]{2}-[0-9]{2}Z-[0-9a-f]{40}$
        discoveryLimit: 20
        gitRepoURL: https://github.com/<org>/<repo>        
        imageSelectionStrategy: Lexical
        repoURL: <account-id>.dkr.ecr.us-west-2.amazonaws.com/<repo>

the includePaths for the git source are all scoped to only manifests related to the applications that are managed by the stages that consume freight from the warehouse.

wmiller112 avatar Oct 10 '24 16:10 wmiller112

If it were the controller taking a long time to discover matching image tags, or being rate limited, I'd expect to see the log "Obtained credentials for image repo" here for the warehouses when a new controller comes up, and then hang. When I delete a controller pod and a new one comes up, it often does not, and the effected/delayed warehouse continues to not get latest freight. Whenever I do finally see that log, the freight becomes available almost immediately.

wmiller112 avatar Oct 15 '24 16:10 wmiller112

When I delete a controller pod and a new one comes up, it often does not

Warehouses are listed and enqueued for reconciliation on startup, of course, but the logic that checks whether spec.interval has elapsed since the last discovery still applies. So most Warehouse reconciliations are going to short-circuit on startup.

krancour avatar Oct 15 '24 21:10 krancour

Looking more at the logic, I see this possibility:

  1. It's almost time to discover new Freight based on time elapsed since last discovery, but not quite.

  2. The Warehouse gets requeued for reconciliation in another spec.interval.

The end result could be up to 2 * spec.Interval elapsing between discoveries.

Fix incoming...

krancour avatar Oct 15 '24 21:10 krancour

Unfortunately seem to be experiencing the same behavior in 1.0.1. Auto freight discovery frequently taking 4x or more the spec.interval, and manually triggered refresh is unresponsive.

wmiller112 avatar Oct 23 '24 04:10 wmiller112