boskos monitor boskos cleanup timing

Originally filed as https://github.com/kubernetes/test-infra/issues/14715 by @BenTheElder

What would you like to be added: export and graph metrics for boskos cleanup timing

Why is this needed: so we can determine if this is increasing and we need to increase the janitor or fix boskos xref #14697

Possibly this should also move to the new monitoring stack? cc @cjwagner @detiber

/area boskos /assign @krzyzacy cc @fejta @mm4tt /kind feature

May 29 '20 00:05 ixdy

@ixdy: The label(s) area/boskos cannot be applied, because the repository doesn't have them

In response to this:

Originally filed as https://github.com/kubernetes/test-infra/issues/14715 by @BenTheElder

What would you like to be added: export and graph metrics for boskos cleanup timing

Why is this needed: so we can determine if this is increasing and we need to increase the janitor or fix boskos xref #14697

Possibly this should also move to the new monitoring stack? cc @cjwagner @detiber

/area boskos /assign @krzyzacy cc @fejta @mm4tt /kind feature

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

May 29 '20 00:05 k8s-ci-robot

/unassign /help-wanted

Jun 19 '20 05:06 krzyzacy

/help

Jun 19 '20 15:06 detiber

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

Sep 17 '20 15:09 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

Oct 17 '20 16:10 fejta-bot

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

Nov 16 '20 17:11 fejta-bot

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Nov 16 '20 17:11 k8s-ci-robot

/reopen /remove-lifecycle stale

Nov 16 '20 18:11 ixdy

@ixdy: Reopened this issue.

In response to this:

/reopen /remove-lifecycle stale

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Nov 16 '20 18:11 k8s-ci-robot

/lifecycle frozen

Nov 17 '20 19:11 detiber

this is something to add a prometheus metric for this operation? @detiber

Jan 19 '21 18:01 cpanato

@cpanato I believe that to be the case, yes. That said, I haven't dug into how the existing metrics are exposed for boskos. The dashboards sit at monitoring.prow.k8s.io, though.

Jan 19 '21 18:01 detiber

hello @ixdy the metric in question should be added in this part https://github.com/kubernetes-sigs/boskos/tree/master/cmd/cleaner ? or it is for another part of the code? maybe the first question is, this is still needed?

Jan 20 '21 09:01 cpanato

Sorry for the delay in response. To clarify, this would be metrics added to the janitor(s), not the (unfortunately named) cleaner component.

The basic gist is just adding some Prometheus metrics to the janitors, yes, but the primary challenge is that in some deployments (such as k8s.io prow) Boskos + the janitors run in a completely separate build cluster from the prow monitoring stack, which makes collecting these metrics more challenging, since they aren't directly accessible.

In the case of k8s.io prow, to collect metrics from the core boskos service, we expose the boskos metrics port on an external IP and then explicitly collect from that address. Since the janitors run as a separate container, we'd need to either expose additional IPs for each janitor (non-ideal) or set up some sort of collector for all of the boskos metrics (core and janitor) and then expose that to the prow monitoring stack. Alternately, we could collect/push these metrics to the monitoring stack. [Note: I'm probably using the wrong Prometheus terminology here.]

Figuring all of this out is the harder aspect of this issue. If this sounds interesting to you, please take it on!

Feb 16 '21 19:02 ixdy

@ixdy thanks and my turn to say sorry for the delay 😄

There are two different things we need to do, the first one is to add the metric in the janitor and the second the infrastructure part.

For the second I have a couple of questions:

the janitor is a cron process or it is always up and running? if is a cron we will need to use Prometheus pushgateway to send the metrics there and then the monitor cluster can scrape from there.
the cluster that runs the boskos and the janitor is the same? then to avoid having multiple LB to expose we can deploy Prometheus in this cluster to collect the metrics and expose this to be scraped by the main monitoring system, so we just have one LB entry point (Prometheus federated)

I will work on the first part to add the metrics while we discuss the second If that sounds good to you

thanks!

Feb 23 '21 11:02 cpanato

/assign /remove-help

Feb 23 '21 11:02 cpanato

Is the janitor a cron process or is it always up and running?

It depends. There are 3 (or 4) different janitor endpoints right now:

a. cmd/aws-janitor: one-shot command that cleans up an AWS account, optionally specifying a region.
b. cmd/aws-janitor-boskos: long-lived process which queries Boskos (using its API) for AWS regions that are in dirty state, cleaning up the relevant region using the same library as (a) and then returning the region in Boskos to the free state.
c. cmd/janitor/gcp_janitor.py: one-shot python script which cleans up the provided GCP project(s). Eventually should be rewritten in Go, probably.
d. cmd/janitor: resource-agnostic janitor that queries Boskos (using its API) for resources of a specified type that are in a dirty state, passing them to a specified janitor command to clean up, returning them to Boskos in the free state (assuming the janitor command exited successfully). Defaults to calling the gcp_janitor.py script, but can potentially call any other one-shot janitor (e.g. the AWS janitor from (a)).

The one-shot janitors could be run as CronJobs, with or without Boskos (e.g. to manage AWS environments, GCP projects, etc that are not managed by Boskos). The Boskos-specific janitors tend to run as long-running pods.

(So one follow-up question you might have: which janitor? The ones most relevant to this issue are probably cmd/aws-janitor-boskos and cmd/janitor, though hopefully you can generalize things enough to reduce the amount of duplicated code.)

The cluster that runs the boskos and the janitor is the same?

In general, yes, the janitors run in the same cluster as Boskos. This is because the necessary credentials/service accounts needed to interact with AWS accounts/GCP projects likely already exist in those clusters, as they are used by the test jobs.

Feb 24 '21 00:02 ixdy

thanks for the clarification @ixdy

Feb 24 '21 08:02 cpanato

boskos boskos copied to clipboard

monitor boskos cleanup timing

boskos
boskos copied to clipboard