k8s.io Setup a budget and budget alerts

ref: https://cloud.google.com/billing/docs/how-to/budgets

Currently we review our billing reports at each meeting, which means we'll notice abnormalities within a 14-day window. As our utilization increases, it would be wise for us to use a budget and alerts to catch things sooner.

I tried experimenting with my account, and didn't have sufficient privileges. We should start there

/priority important-longterm /wg k8s-infra

Oct 29 '20 20:10 spiffxp

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

Jan 27 '21 20:01 fejta-bot

/remove-lifecycle stale /assign @thockin I'm assigning you to get your input on whether you think this is worth investing time in

Feb 08 '21 16:02 spiffxp

I think it is long-term valuable but not near-term

Jun 07 '21 23:06 thockin

/remove-priority important-longterm /priority critical-urgent /milestone v1.23

We discussed last meeting that our spend looks like it's going to put us very near the threshold this year

It's time to come up with a plan for how to make sure we don't cross it, and how to detect if we are about to. Maybe it's not worth implementing technically with cloud budgets, but we should then at least know what number over what period is a flashing danger sign, and have some kind of framework / guidance for what to do next once we see it.

Sep 29 '21 19:09 spiffxp

https://github.com/kubernetes/k8s.io/pull/2940 - Adds a monthly budget for k8s-infra as a whole, we'll get e-mail alerts if we hit 90% (225K) for the month (which we have been crossing continually since August, but with no alerts setup) and 100% (which we crossed once in August accidentally due to 5k node clusters hanging around for too long)

Oct 14 '21 19:10 spiffxp

Our billing report doesn't do a great job of rolling up similar classes of projects, so I plugged the following into BigQuery

select
    sum(cost) as total_cost,
    invoice.month,
    case
        when regexp_contains(project.name, r'k8s-infra-e2e-boskos-[0-9]+') then 'e2e-gce'
        when regexp_contains(project.name, r'k8s-infra-e2e-boskos-gpu-[0-9]+') then 'e2e-gpu'
        when regexp_contains(project.name, r'k8s-infra-e2e-boskos-scale-[0-9]+') then 'e2e-scale'
        when regexp_contains(project.name, r'k8s-staging-.+') then 'staging'
        when project.name = 'k8s-infra-e2e-scale-5k-project' then 'e2e-5k'
        else project.name
    end as project_type
from 
    `kubernetes-public.kubernetes_public_billing.gcp_billing_export_v1_018801_93540E_22A20E`
where
    billing_account_id = "018801-93540E-22A20E"
group by
    invoice.month,
    project_type
order by
    invoice.month desc, total_cost desc

Then "explored in data studio" to come up with these charts (left is most recent):

Stacked (keep in mind $3M / 12mo = 250K/mo for our budget) Screen Shot 2021-10-15 at 6 38 19 AM

Regular Screen Shot 2021-10-15 at 6 39 06 AM

It's pretty clear our artifact hosting costs have been steadily growing. The 5k scale jobs pushed us over the limit in august, but even if we dropped those, we're going to hit our budget in a month if we do nothing about artifact hosting costs.

Oct 15 '21 13:10 spiffxp

What bin is egress bandwidth going into? Would it be possible to get the artifacts broken out in terms of size in bytes instead of $/months?

Oct 15 '21 17:10 jhoblitt

What bin is egress bandwidth going into?

Egress is charged to the project hosting the artifacts being transferred, so regardless of which SKU it's billed against, it all goes against the k8s-artifacts-prod project

From https://datastudio.google.com/c/u/0/reporting/14UWSuqD5ef9E4LnsCD9uJWTPv8MHOA3e/page/bPVn Screen Shot 2021-10-18 at 7 57 33 AM

Would it be possible to get the artifacts broken out in terms of size in bytes instead of $/months?

select
    sum(cost) as total_cost,
    sku.description as sku,
    sum(usage.amount_in_pricing_units) amount,
    usage.pricing_unit pricing_unit,
    invoice.month,
from 
    `kubernetes-public.kubernetes_public_billing.gcp_billing_export_v1_018801_93540E_22A20E`
where
    billing_account_id = "018801-93540E-22A20E"
    and project.name = 'k8s-artifacts-prod'
    and usage.pricing_unit = 'gibibyte'
group by
    invoice.month,
    sku,
    pricing_unit
order by
    invoice.month desc, total_cost desc

The units here are GB Screen Shot 2021-10-18 at 8 22 04 AM

From https://console.cloud.google.com/monitoring/metrics-explorer?pageState=%7B%22xyChart%22:%7B%22dataSets%22:%5B%7B%22timeSeriesFilter%22:%7B%22filter%22:%22metric.type%3D%5C%22storage.googleapis.com%2Fnetwork%2Fsent_bytes_count%5C%22%20resource.type%3D%5C%22gcs_bucket%5C%22%22,%22minAlignmentPeriod%22:%2260s%22,%22aggregations%22:%5B%7B%22perSeriesAligner%22:%22ALIGN_RATE%22,%22crossSeriesReducer%22:%22REDUCE_SUM%22,%22alignmentPeriod%22:%2260s%22,%22groupByFields%22:%5B%22resource.label.%5C%22bucket_name%5C%22%22%5D%7D,%7B%22perSeriesAligner%22:%22ALIGN_NONE%22,%22crossSeriesReducer%22:%22REDUCE_NONE%22,%22alignmentPeriod%22:%2260s%22,%22groupByFields%22:%5B%5D%7D%5D,%22pickTimeSeriesFilter%22:%7B%22rankingMethod%22:%22METHOD_MAX%22,%22numTimeSeries%22:%225%22,%22direction%22:%22TOP%22%7D%7D,%22targetAxis%22:%22Y1%22,%22plotType%22:%22LINE%22,%22legendTemplate%22:%22$%7Bresource.labels.bucket_name%7D%22%7D%5D,%22options%22:%7B%22mode%22:%22COLOR%22%7D,%22constantLines%22:%5B%5D,%22timeshiftDuration%22:%220s%22,%22y1Axis%22:%7B%22label%22:%22y1Axis%22,%22scale%22:%22LINEAR%22%7D%7D,%22isAutoRefresh%22:true,%22timeSelection%22:%7B%22timeRange%22:%226w%22%7D%7D&project=kubernetes-public

Bytes sent, top 5 by max value over the last 6W (I don't think our cloud monitoring retention goes further back than that) Screen Shot 2021-10-18 at 8 31 31 AM

I would defer to @BobyMCbobs and @Riaankl to provide a report on which specific artifacts are how large, and how often they're being transferred. That said, I think this is a problem of volume and not specific artifacts.

Oct 18 '21 15:10 spiffxp

https://github.com/kubernetes/k8s.io/issues/1834#issuecomment-943836836 is our umbrella issue for mitigating artifact hosting costs by use of mirrors, which would allow us to mitigate costs due to large consumers by having them pull from mirrors located closer to them or on their own infra. The comment I'm linking posits that if we could use something like Cloud CDN we could also lower the cost of hosting regardless of where requests are coming from.

It is unclear whether this is possible for container images hosted at k8s.gcr.io which are the vast majority of bytes transferred, as they live in a subdomain of gcr.io that I'm not sure we can take ownership of (replace the endpoint), my understanding is it was provided to us internally

Oct 18 '21 15:10 spiffxp

@jhoblitt we have a report on artifact traffic. This data run form 9 April till Sept 2021 There are several graphs and tables. Here is tables that might answer some of you questions:

Oct 19 '21 18:10 riaankleinhans

@spiffxp Thanks for doing that extra analysis. I agree that this sounds more like a pure popularity problem rather than bloated artifacts. I'm not sure what a fitted slope works out to but I'm going to guess that transfers are going to grow faster than gcp bandwidth prices will decrease in the near term and will eventually exceed the total cost envelope. Has there been any discussion of moving away from gcr.io? I would easily believe it will take > 3 years to shift the majority of pulls over to a k8s project registry.

Oct 19 '21 18:10 jhoblitt

@Riaankl I was wondering if there were large artifacts that could be put on a diet but nothing is showing up in the top 10.

Oct 19 '21 18:10 jhoblitt

With #1834 we aim to get 2-3 redicetor POC's up where by cloud providers could have local artifacts, and routing is affected by the Redirector based on the requesitng IP's ASN information. Therefore the load is spread to all providers. Idealy the complete set of aritfacts should be hosted by the participants. 80% of the traffic is related to <30 images.

Oct 19 '21 19:10 riaankleinhans

/milestone v1.24

Dec 06 '21 18:12 ameukam

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Mar 06 '22 18:03 k8s-triage-robot

/remove-lifecycle stale

Mar 07 '22 06:03 ameukam

/milestone clear

May 12 '22 03:05 ameukam

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Aug 10 '22 04:08 k8s-triage-robot

/remove-lifecycle stale /lifecycle frozen

Aug 19 '22 21:08 ameukam

as per https://github.com/kubernetes/k8s.io/issues/1375#issuecomment-943644748 yes, we have alerts
spend breakdown: https://datastudio.google.com/c/u/0/reporting/14UWSuqD5ef9E4LnsCD9uJWTPv8MHOA3e1

What exactly do we see as outstanding here?

Nov 11 '22 22:11 BenTheElder

spend breakdown: https://datastudio.google.com/c/u/0/reporting/14UWSuqD5ef9E4LnsCD9uJWTPv8MHOA3e1

FWIW I can't access this

What exactly do we see as outstanding here?

I agree with capping this off as the first pass. I think we'll want to revisit how we track our budget in the new year, and that should probably be a separate issue.

Things you might want to consider before capping this off:

The current budget alerts at 90% and 100% of 250K/mo (3M/y). Since we're running over that rate, the alerts are going to be noise for those watching "are we out of credits for the year". Disable and setup a new budget that tracks our remaining spend for the year?
The alerts currently get sent out to k8s-infra leads, consider adding a wider audience?

I'll leave it to @ameukam or others to close if you're fine with this as-is.

Nov 11 '22 23:11 spiffxp

ACK -- Sorry that link should've been https://datastudio.google.com/c/u/0/reporting/14UWSuqD5ef9E4LnsCD9uJWTPv8MHOA3e/page/tPVn

Nov 11 '22 23:11 BenTheElder

I think it's ok to close this. Let revisit budget tracking for next year in a separate issue. The different attempts to move workloads to different cloud providers will hopefully impact overall 2023 budget.

/close

Nov 12 '22 00:11 ameukam

@ameukam: Closing this issue.

In response to this:

I think it's ok to close this. Let revisit budget tracking for next year in a separate issue. The different attempts to move workloads to different cloud providers will hopefully impact overall 2023 budget.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Nov 12 '22 00:11 k8s-ci-robot

k8s.io k8s.io copied to clipboard

Setup a budget and budget alerts

k8s.io
k8s.io copied to clipboard