Delayed freight discovery
Checklist
- [x] I've searched the issue queue to verify this is not a duplicate bug report.
- [x] I've included steps to reproduce the bug.
- [x] I've pasted the output of
kargo version. - [x] I've pasted logs, if applicable.
Description
Frequently I've found that new image tags will be available that match the specified warehouse configuration, but they will not be discovered in a timely manor. The warehouse is configured with spec.interval of 2m0s, but with debug logs enabled, I see the warehouse reconcile loop happens far less frequently, but its often random. Clicking the refresh button also does not seem to reliably trigger the check for new artifacts.
Steps to Reproduce
- Create a warehouse with spec similar to
spec:
freightCreationPolicy: Automatic
interval: 2m0s
subscriptions:
- git:
branch: main
commitSelectionStrategy: NewestFromBranch
discoveryLimit: 20
includePaths:
- regex:path/to/manifests/.*
repoURL: https://github.com/<org>/<repo>
- image:
allowTags: ^production-[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}-[0-9]{2}-[0-9]{2}Z-[0-9a-f]{40}$
discoveryLimit: 20
gitRepoURL: https://github.com/<org>/<repo>
imageSelectionStrategy: Lexical
repoURL:xxxxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/<repo>
- Push a tag matching the allowTags schema
- Monitor debug logs in controller to observe the refresh does not happen at the specified
interval - Click
refreshon the warehouse, confirm the annotation is added with updated ts - Continue to monitor debug logs to see that frequently a refresh will not happen within specified
intervalof clicking refresh - Eventually a reconcile loop will run, new tag discovered, and the rest of the process works as expected
Version
Client Version: v0.8.4
Server Version: v0.8.4
Logs
time="2024-09-18T15:48:08Z" level=debug msg="reconciling Warehouse" namespace=<project> warehouse=sources
time="2024-09-18T15:48:08Z" level=debug msg="obtained credentials for git repo" namespace=<project>repo="https://github.com/<org>/<repo>" warehouse=sources
time="2024-09-18T15:48:20Z" level=debug msg="discovered commits" count=20 namespace=<project> repo="https://github.com/<org>/<repo>" warehouse=sources
time="2024-09-18T15:48:20Z" level=debug msg="Controller IAM role is not authorized to assume project-specific role or project-specific role is not authorized to obtain an ECR auth token. Falling back to using controller's IAM role directly." awsAccountID=xxxxxxxxxx awsRegion=us-west-2 namespace=l<project> project=l<project>warehouse=sources
time="2024-09-18T15:48:20Z" level=debug msg="got ECR authorization token" awsAccountID=xxxxxxxxxx awsRegion=us-west-2 namespace=<project> project=<project> warehouse=sources
time="2024-09-18T15:48:20Z" level=debug msg="obtained credentials for image repo" namespace=<project>repo=xxxxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/<repo> warehouse=sources
time="2024-09-18T15:48:23Z" level=debug msg="discovered images" count=20 namespace=<project> repo=xxxxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/<repo> warehouse=sources
time="2024-09-18T15:48:23Z" level=debug msg="discovered latest artifacts" namespace=<project> warehouse=sources
time="2024-09-18T15:48:23Z" level=debug msg="done reconciling Warehouse" namespace=<project>warehouse=sources
...
time="2024-09-18T16:04:26Z" level=debug msg="reconciling Warehouse" namespace=<project> warehouse=sources
time="2024-09-18T16:04:26Z" level=debug msg="obtained credentials for git repo" namespace=<project>repo="https://github.com/<org>/<repo>" warehouse=sources
time="2024-09-18T16:04:39Z" level=debug msg="discovered commits" count=20 namespace=<project> repo="https://github.com/<org>/<repo>" warehouse=sources
time="2024-09-18T16:04:39Z" level=debug msg="obtained credentials for image repo" namespace=<project> repo=717232957798.dkr.ecr.us-west-2.amazonaws.com/<repo> warehouse=sources
time="2024-09-18T16:04:43Z" level=debug msg="discovered images" count=20 namespace=<project> repo=717232957798.dkr.ecr.us-west-2.amazonaws.com/<repo> warehouse=sources
time="2024-09-18T16:04:43Z" level=debug msg="discovered latest artifacts" namespace=<project> warehouse=sources
time="2024-09-18T16:04:43Z" level=debug msg="created Freight" freight=7a4a82965db54bf6d231bf6bfb68f13f39dcd9ac namespace=<project> warehouse=sources
time="2024-09-18T16:04:43Z" level=debug msg="done reconciling Warehouse" namespace=<project> warehouse=sources
Looking at this a bit more, I do see the following in the api pod logs, specifically when a manual refresh is triggered.
time="2024-10-01T18:35:17Z" level=error msg="finished streaming call" connect.code=internal connect.duration=3.79452559s connect.method=WatchWarehouses connect.service=akuity.io.kargo.service.v1alpha1.KargoService connect.start_time="2024-10-01T18:35:13Z" error="internal: context canceled"
time="2024-10-01T18:35:17Z" level=error msg="finished streaming call" connect.code=internal connect.duration=3.791075582s connect.method=WatchStages connect.service=akuity.io.kargo.service.v1alpha1.KargoService connect.start_time="2024-10-01T18:35:13Z" error="internal: context canceled"
Figured it was a resource limitation, but the controller seems to use as much memory as I give it. It was at 2Gi when I opened this, currently at 10Gi. The interval refreshes seem a lot more consistent with more memory, but the above error happens any time a manual refresh is triggered. Once a manual refresh is triggered, I sometimes observe that interval refreshes stop working as well until I manually remove the refresh annotation.
How many tags in total are there in the targeted image repository?
Largest is about 1100 total tags, ~60 match the filter. ~1500 images total, some untagged, but policies keep it around this number pretty consistently. Find that it happens for projects whose warehouses have a very small number of tags as well though
Those API server logs do not appear related to this in any way. They look to me like the UI's client disconnecting abruptly.
I've updated image retention. Max tags in any repo kargo uses is now 500 with around 30 matching in any repo, but issue persists.
I would be very surprised if this weren't simply that discovery is taking a very long time and rate limiting may be a factor. Out of curiosity, are you using one of the public registries?
Can you share an (anonymized) Warehouse that presents this issue? Trying to see if we can narrow this down, as the heaviness of the indexing depends partly on the selectors you use.
All images come from ECR. I do agree, it feels like a rate limit - it will sometimes discovery newly tagged images immediately, other times takes 40 minutes. But I've reviewed ECR rate limits in Service Quotas, and nothing stands out (highest is ~10 req/s). Also checks Github app rate limits thinking maybe its discovering image but not manifests, but those are all very low as well.
Here is the warehouse definition:
apiVersion: kargo.akuity.io/v1alpha1
kind: Warehouse
metadata:
name: sources
namespace: <project>
spec:
freightCreationPolicy: Automatic
interval: 5m0s
subscriptions:
- git:
branch: main
commitSelectionStrategy: NewestFromBranch
discoveryLimit: 20
includePaths:
- regex:<app-manifest-path>/resources/.*
- regex:<app-manifest-path>/services/.*
- regex:<app-manifest-path>/charts/.*
repoURL: https://github.com/<org>/<repo>
- image:
allowTags: >-
^production-[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}-[0-9]{2}-[0-9]{2}Z-[0-9a-f]{40}$
discoveryLimit: 20
gitRepoURL: https://github.com/<org>/<repo>
imageSelectionStrategy: Lexical
repoURL: <account-id>.dkr.ecr.us-west-2.amazonaws.com/<repo>
the includePaths for the git source are all scoped to only manifests related to the applications that are managed by the stages that consume freight from the warehouse.
If it were the controller taking a long time to discover matching image tags, or being rate limited, I'd expect to see the log "Obtained credentials for image repo" here for the warehouses when a new controller comes up, and then hang. When I delete a controller pod and a new one comes up, it often does not, and the effected/delayed warehouse continues to not get latest freight. Whenever I do finally see that log, the freight becomes available almost immediately.
When I delete a controller pod and a new one comes up, it often does not
Warehouses are listed and enqueued for reconciliation on startup, of course, but the logic that checks whether spec.interval has elapsed since the last discovery still applies. So most Warehouse reconciliations are going to short-circuit on startup.
Looking more at the logic, I see this possibility:
-
It's almost time to discover new Freight based on time elapsed since last discovery, but not quite.
-
The Warehouse gets requeued for reconciliation in another
spec.interval.
The end result could be up to 2 * spec.Interval elapsing between discoveries.
Fix incoming...
Unfortunately seem to be experiencing the same behavior in 1.0.1. Auto freight discovery frequently taking 4x or more the spec.interval, and manually triggered refresh is unresponsive.