k8s.io
k8s.io copied to clipboard
Setup a budget and budget alerts
ref: https://cloud.google.com/billing/docs/how-to/budgets
Currently we review our billing reports at each meeting, which means we'll notice abnormalities within a 14-day window. As our utilization increases, it would be wise for us to use a budget and alerts to catch things sooner.
I tried experimenting with my account, and didn't have sufficient privileges. We should start there
/priority important-longterm /wg k8s-infra
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale /assign @thockin I'm assigning you to get your input on whether you think this is worth investing time in
I think it is long-term valuable but not near-term
/remove-priority important-longterm /priority critical-urgent /milestone v1.23
We discussed last meeting that our spend looks like it's going to put us very near the threshold this year
It's time to come up with a plan for how to make sure we don't cross it, and how to detect if we are about to. Maybe it's not worth implementing technically with cloud budgets, but we should then at least know what number over what period is a flashing danger sign, and have some kind of framework / guidance for what to do next once we see it.
https://github.com/kubernetes/k8s.io/pull/2940 - Adds a monthly budget for k8s-infra as a whole, we'll get e-mail alerts if we hit 90% (225K) for the month (which we have been crossing continually since August, but with no alerts setup) and 100% (which we crossed once in August accidentally due to 5k node clusters hanging around for too long)
Our billing report doesn't do a great job of rolling up similar classes of projects, so I plugged the following into BigQuery
select
sum(cost) as total_cost,
invoice.month,
case
when regexp_contains(project.name, r'k8s-infra-e2e-boskos-[0-9]+') then 'e2e-gce'
when regexp_contains(project.name, r'k8s-infra-e2e-boskos-gpu-[0-9]+') then 'e2e-gpu'
when regexp_contains(project.name, r'k8s-infra-e2e-boskos-scale-[0-9]+') then 'e2e-scale'
when regexp_contains(project.name, r'k8s-staging-.+') then 'staging'
when project.name = 'k8s-infra-e2e-scale-5k-project' then 'e2e-5k'
else project.name
end as project_type
from
`kubernetes-public.kubernetes_public_billing.gcp_billing_export_v1_018801_93540E_22A20E`
where
billing_account_id = "018801-93540E-22A20E"
group by
invoice.month,
project_type
order by
invoice.month desc, total_cost desc
Then "explored in data studio" to come up with these charts (left is most recent):
Stacked (keep in mind $3M / 12mo = 250K/mo for our budget)
Regular
It's pretty clear our artifact hosting costs have been steadily growing. The 5k scale jobs pushed us over the limit in august, but even if we dropped those, we're going to hit our budget in a month if we do nothing about artifact hosting costs.
What bin is egress bandwidth going into? Would it be possible to get the artifacts broken out in terms of size in bytes instead of $/months?
What bin is egress bandwidth going into?
Egress is charged to the project hosting the artifacts being transferred, so regardless of which SKU it's billed against, it all goes against the k8s-artifacts-prod project
From https://datastudio.google.com/c/u/0/reporting/14UWSuqD5ef9E4LnsCD9uJWTPv8MHOA3e/page/bPVn
Would it be possible to get the artifacts broken out in terms of size in bytes instead of $/months?
select
sum(cost) as total_cost,
sku.description as sku,
sum(usage.amount_in_pricing_units) amount,
usage.pricing_unit pricing_unit,
invoice.month,
from
`kubernetes-public.kubernetes_public_billing.gcp_billing_export_v1_018801_93540E_22A20E`
where
billing_account_id = "018801-93540E-22A20E"
and project.name = 'k8s-artifacts-prod'
and usage.pricing_unit = 'gibibyte'
group by
invoice.month,
sku,
pricing_unit
order by
invoice.month desc, total_cost desc
The units here are GB
From https://console.cloud.google.com/monitoring/metrics-explorer?pageState=%7B%22xyChart%22:%7B%22dataSets%22:%5B%7B%22timeSeriesFilter%22:%7B%22filter%22:%22metric.type%3D%5C%22storage.googleapis.com%2Fnetwork%2Fsent_bytes_count%5C%22%20resource.type%3D%5C%22gcs_bucket%5C%22%22,%22minAlignmentPeriod%22:%2260s%22,%22aggregations%22:%5B%7B%22perSeriesAligner%22:%22ALIGN_RATE%22,%22crossSeriesReducer%22:%22REDUCE_SUM%22,%22alignmentPeriod%22:%2260s%22,%22groupByFields%22:%5B%22resource.label.%5C%22bucket_name%5C%22%22%5D%7D,%7B%22perSeriesAligner%22:%22ALIGN_NONE%22,%22crossSeriesReducer%22:%22REDUCE_NONE%22,%22alignmentPeriod%22:%2260s%22,%22groupByFields%22:%5B%5D%7D%5D,%22pickTimeSeriesFilter%22:%7B%22rankingMethod%22:%22METHOD_MAX%22,%22numTimeSeries%22:%225%22,%22direction%22:%22TOP%22%7D%7D,%22targetAxis%22:%22Y1%22,%22plotType%22:%22LINE%22,%22legendTemplate%22:%22$%7Bresource.labels.bucket_name%7D%22%7D%5D,%22options%22:%7B%22mode%22:%22COLOR%22%7D,%22constantLines%22:%5B%5D,%22timeshiftDuration%22:%220s%22,%22y1Axis%22:%7B%22label%22:%22y1Axis%22,%22scale%22:%22LINEAR%22%7D%7D,%22isAutoRefresh%22:true,%22timeSelection%22:%7B%22timeRange%22:%226w%22%7D%7D&project=kubernetes-public
Bytes sent, top 5 by max value over the last 6W (I don't think our cloud monitoring retention goes further back than that)
I would defer to @BobyMCbobs and @Riaankl to provide a report on which specific artifacts are how large, and how often they're being transferred. That said, I think this is a problem of volume and not specific artifacts.
https://github.com/kubernetes/k8s.io/issues/1834#issuecomment-943836836 is our umbrella issue for mitigating artifact hosting costs by use of mirrors, which would allow us to mitigate costs due to large consumers by having them pull from mirrors located closer to them or on their own infra. The comment I'm linking posits that if we could use something like Cloud CDN we could also lower the cost of hosting regardless of where requests are coming from.
It is unclear whether this is possible for container images hosted at k8s.gcr.io which are the vast majority of bytes transferred, as they live in a subdomain of gcr.io that I'm not sure we can take ownership of (replace the endpoint), my understanding is it was provided to us internally
@jhoblitt we have a report on artifact traffic. This data run form 9 April till Sept 2021
There are several graphs and tables. Here is tables that might answer some of you questions:
@spiffxp Thanks for doing that extra analysis. I agree that this sounds more like a pure popularity problem rather than bloated artifacts. I'm not sure what a fitted slope works out to but I'm going to guess that transfers are going to grow faster than gcp bandwidth prices will decrease in the near term and will eventually exceed the total cost envelope. Has there been any discussion of moving away from gcr.io? I would easily believe it will take > 3 years to shift the majority of pulls over to a k8s project registry.
@Riaankl I was wondering if there were large artifacts that could be put on a diet but nothing is showing up in the top 10.
With #1834 we aim to get 2-3 redicetor POC's up where by cloud providers could have local artifacts, and routing is affected by the Redirector based on the requesitng IP's ASN information. Therefore the load is spread to all providers. Idealy the complete set of aritfacts should be hosted by the participants.
80% of the traffic is related to <30 images.
/milestone v1.24
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
/milestone clear
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale /lifecycle frozen
- as per https://github.com/kubernetes/k8s.io/issues/1375#issuecomment-943644748 yes, we have alerts
- spend breakdown: https://datastudio.google.com/c/u/0/reporting/14UWSuqD5ef9E4LnsCD9uJWTPv8MHOA3e1
What exactly do we see as outstanding here?
spend breakdown: https://datastudio.google.com/c/u/0/reporting/14UWSuqD5ef9E4LnsCD9uJWTPv8MHOA3e1
FWIW I can't access this
What exactly do we see as outstanding here?
I agree with capping this off as the first pass. I think we'll want to revisit how we track our budget in the new year, and that should probably be a separate issue.
Things you might want to consider before capping this off:
- The current budget alerts at 90% and 100% of 250K/mo (3M/y). Since we're running over that rate, the alerts are going to be noise for those watching "are we out of credits for the year". Disable and setup a new budget that tracks our remaining spend for the year?
- The alerts currently get sent out to k8s-infra leads, consider adding a wider audience?
I'll leave it to @ameukam or others to close if you're fine with this as-is.
ACK -- Sorry that link should've been https://datastudio.google.com/c/u/0/reporting/14UWSuqD5ef9E4LnsCD9uJWTPv8MHOA3e/page/tPVn
I think it's ok to close this. Let revisit budget tracking for next year in a separate issue. The different attempts to move workloads to different cloud providers will hopefully impact overall 2023 budget.
/close
@ameukam: Closing this issue.
In response to this:
I think it's ok to close this. Let revisit budget tracking for next year in a separate issue. The different attempts to move workloads to different cloud providers will hopefully impact overall 2023 budget.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.