boskos
boskos copied to clipboard
monitor boskos cleanup timing
Originally filed as https://github.com/kubernetes/test-infra/issues/14715 by @BenTheElder
What would you like to be added: export and graph metrics for boskos cleanup timing
Why is this needed: so we can determine if this is increasing and we need to increase the janitor or fix boskos xref #14697
Possibly this should also move to the new monitoring stack? cc @cjwagner @detiber
/area boskos /assign @krzyzacy cc @fejta @mm4tt /kind feature
@ixdy: The label(s) area/boskos
cannot be applied, because the repository doesn't have them
In response to this:
Originally filed as https://github.com/kubernetes/test-infra/issues/14715 by @BenTheElder
What would you like to be added: export and graph metrics for boskos cleanup timing
Why is this needed: so we can determine if this is increasing and we need to increase the janitor or fix boskos xref #14697
Possibly this should also move to the new monitoring stack? cc @cjwagner @detiber
/area boskos /assign @krzyzacy cc @fejta @mm4tt /kind feature
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/unassign /help-wanted
/help
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity. Reopen the issue with
/reopen
. Mark the issue as fresh with/remove-lifecycle rotten
.Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen /remove-lifecycle stale
@ixdy: Reopened this issue.
In response to this:
/reopen /remove-lifecycle stale
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/lifecycle frozen
this is something to add a prometheus metric for this operation? @detiber
@cpanato I believe that to be the case, yes. That said, I haven't dug into how the existing metrics are exposed for boskos. The dashboards sit at monitoring.prow.k8s.io, though.
hello @ixdy the metric in question should be added in this part https://github.com/kubernetes-sigs/boskos/tree/master/cmd/cleaner ? or it is for another part of the code? maybe the first question is, this is still needed?
Sorry for the delay in response. To clarify, this would be metrics added to the janitor(s), not the (unfortunately named) cleaner component.
The basic gist is just adding some Prometheus metrics to the janitors, yes, but the primary challenge is that in some deployments (such as k8s.io prow) Boskos + the janitors run in a completely separate build cluster from the prow monitoring stack, which makes collecting these metrics more challenging, since they aren't directly accessible.
In the case of k8s.io prow, to collect metrics from the core boskos service, we expose the boskos metrics port on an external IP and then explicitly collect from that address. Since the janitors run as a separate container, we'd need to either expose additional IPs for each janitor (non-ideal) or set up some sort of collector for all of the boskos metrics (core and janitor) and then expose that to the prow monitoring stack. Alternately, we could collect/push these metrics to the monitoring stack. [Note: I'm probably using the wrong Prometheus terminology here.]
Figuring all of this out is the harder aspect of this issue. If this sounds interesting to you, please take it on!
@ixdy thanks and my turn to say sorry for the delay 😄
There are two different things we need to do, the first one is to add the metric in the janitor and the second the infrastructure part.
For the second I have a couple of questions:
-
the janitor is a cron process or it is always up and running? if is a cron we will need to use Prometheus pushgateway to send the metrics there and then the monitor cluster can scrape from there.
-
the cluster that runs the boskos and the janitor is the same? then to avoid having multiple LB to expose we can deploy Prometheus in this cluster to collect the metrics and expose this to be scraped by the main monitoring system, so we just have one LB entry point (Prometheus federated)
I will work on the first part to add the metrics while we discuss the second If that sounds good to you
thanks!
/assign /remove-help
- Is the janitor a cron process or is it always up and running?
It depends. There are 3 (or 4) different janitor endpoints right now:
- a.
cmd/aws-janitor
: one-shot command that cleans up an AWS account, optionally specifying a region. - b.
cmd/aws-janitor-boskos
: long-lived process which queries Boskos (using its API) for AWS regions that are indirty
state, cleaning up the relevant region using the same library as (a) and then returning the region in Boskos to thefree
state. - c.
cmd/janitor/gcp_janitor.py
: one-shot python script which cleans up the provided GCP project(s). Eventually should be rewritten in Go, probably. - d.
cmd/janitor
: resource-agnostic janitor that queries Boskos (using its API) for resources of a specified type that are in adirty
state, passing them to a specified janitor command to clean up, returning them to Boskos in thefree
state (assuming the janitor command exited successfully). Defaults to calling thegcp_janitor.py
script, but can potentially call any other one-shot janitor (e.g. the AWS janitor from (a)).
The one-shot janitors could be run as CronJobs, with or without Boskos (e.g. to manage AWS environments, GCP projects, etc that are not managed by Boskos). The Boskos-specific janitors tend to run as long-running pods.
(So one follow-up question you might have: which janitor? The ones most relevant to this issue are probably cmd/aws-janitor-boskos
and cmd/janitor
, though hopefully you can generalize things enough to reduce the amount of duplicated code.)
- The cluster that runs the boskos and the janitor is the same?
In general, yes, the janitors run in the same cluster as Boskos. This is because the necessary credentials/service accounts needed to interact with AWS accounts/GCP projects likely already exist in those clusters, as they are used by the test jobs.
thanks for the clarification @ixdy