promo-tools Auditor not querying GCR

Auditor not querying GCR

Open tylerferrara opened this issue 2 years ago • 10 comments

What happened:

The GCP project: k8s-artifacts-prod does not appear to show it's GCR API being requested. This is unexpected behavior, as the auditor should be continuously verifying new GCR state changes. When it sees a new image that is not inside k/k8s.io, it consults the GCR to verify this image. However, since there hasn't been ANY queries to GCR in 30+ days, the auditor must not be running correctly.

Related to k/k8s.io issue: #2364

What you expected to happen:

The API service should show some amount of requests for GCR.

How to reproduce it (as minimally and precisely as possible):

N/A

Anything else we need to know?:

pantheon corp google com_apis_api_containerregistry googleapis com_quotas_project=k8s-artifacts-prod pageState=(%22duration%22_(%22groupValue%22_%22P30D%22,%22customValue%22_null)) pantheon corp google com_iam-admin_quotas_project=k8s-artifacts-prod pageState=(%22allQuotasTable%22_(%22f%22_%22%255B%257B_22k_22_3A_22Service_22_2C_22t_22_3A10_2C_22v_22_3A_22_5C_22Container%2520Registry%2520API_5C_22_22_2C_22s_22_3At (2)

Environment:

Cloud provider or hardware configuration: GCP cc: @listx @amwat @justaugustus @kubernetes-sigs/release-engineering

Jul 19 '21 21:07 tylerferrara

Thinking about this again, I think it's because the auditor queries the staging GCRs, not prod. See https://github.com/kubernetes-sigs/k8s-container-image-promoter/blob/bed8661305e1e325d0dd22b61e0cea2967b0f0e4/legacy/audit/auditor.go#L261-L266. The subproject here means the staging repo that belongs to the subproject, not production.

Jul 19 '21 22:07 listx

@listx appears to be correct! Here we see two functions that make HTTP requests to GCR. I don't believe this would be monitored by the Container Registry API. This would explain why the auditor is not triggering these metrics. This PR #356 looks to gain visibility on this topic by logging total number of GCR requests per 10 min period. This is how GCR measures it's quotas, therefore we can notice how close the auditor is getting to the upper limit of 50,000 requests per 10min within the logs.

Jul 21 '21 00:07 tylerferrara

Deployment of this PR: #383 should help us gain insight into the auditor's network usage.

Jul 30 '21 20:07 tylerferrara

According the the logs, the auditor is periodically restarting. restart It doesn't appear to show any error or failing behavior which makes a strong case that Cloud Run is throttling it. Auto-scaling is a feature of Cloud Run which may be giving us this unwanted behavior. If we take a look at the metrics, the auditor seems to be using so little CPU and receiving no requests, forcing it to remain idle. idle request-count Although the docs suggest specifying --min-instances to remain "permanently available" our efforts to do so, in our deploy script, does not seem to have that effect.

Aug 04 '21 00:08 tylerferrara

We have successfully captured an instance where the Auditor has exceeded quota!

This was during the audit of mock/kube-scheduler: pantheon corp google com_run_detail_us-central1_cip-auditor_logs_project=k8s-artifacts-prod (2)

Aug 05 '21 16:08 tylerferrara

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dec 29 '21 01:12 k8s-triage-robot

/remove-lifecycle stale

Dec 29 '21 09:12 cpanato

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Mar 29 '22 09:03 k8s-triage-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 05 '22 21:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Aug 04 '22 21:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Sep 03 '22 22:09 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sep 03 '22 22:09 k8s-ci-robot

promo-tools promo-tools copied to clipboard

Auditor not querying GCR

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

promo-tools
promo-tools copied to clipboard