k8s.io icon indicating copy to clipboard operation
k8s.io copied to clipboard

Refactor infra/gcp/...

Open thockin opened this issue 5 years ago • 20 comments

Right now it is set up as "concept-first". For example "ensure-staging" says "all of these are staging-like" and "ensure-prod" says "all of these are prod-like". That makes it hard to get a sense of what any one project has going on.

I propose to refactor it to "project-first". One list of projects and each one says "I am prod like" or "I am staging like". Then I could simply say ensure-project k8s-foo-bar and all of the properties would be asserted.

This is very close to terraform territory, but I don't know TF well enough to make the "utility" functions to not be so duplicated. @cblecker - is this worth pursuing?

thockin avatar Dec 16 '19 22:12 thockin

Yes, it is absolutely worth pursuing. The bash ensure stuff is getting out of hand IMO.

cblecker avatar Dec 16 '19 23:12 cblecker

Shall we discuss the idea sometime and you can either volunteer yourself or write enough that we can solicit other volunteers?

On Mon, Dec 16, 2019 at 3:00 PM Christoph Blecker [email protected] wrote:

Yes, it is absolutely worth pursuing. The bash ensure stuff is getting out of hand IMO.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kubernetes/k8s.io/issues/516?email_source=notifications&email_token=ABKWAVA273YNPZVGHHLBLQTQZAB7FA5CNFSM4J3S77C2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHANOXI#issuecomment-566286173, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKWAVAGRJBN2KMPTHEC4A3QZAB7FANCNFSM4J3S77CQ .

thockin avatar Dec 16 '19 23:12 thockin

@thockin just taking a walk into the issues, maybe this is related: https://github.com/kubernetes/k8s.io/pull/523

I also saw the discussion into the mailing list, in my opinion Terraform is the best way to do this :)

I just don't think I have enough knowledge into that to help with the Terraform stuff, but anyway just putting the PR here (again) so we may have a follow up.

rikatz avatar Jan 09 '20 03:01 rikatz

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Apr 08 '20 03:04 fejta-bot

/remove-lifecycle stale

bartsmykla avatar Apr 08 '20 03:04 bartsmykla

/area cluster-mgmt /area cluster-infra /kind cleanup

spiffxp avatar Apr 15 '20 19:04 spiffxp

Having dipped my toes into adding to this mess:

  • I hated concept-first at first, and thought project-first was a great idea
  • I'm now less sure, as some concept span projects (image promotion needs access to all these, prow/boskos will need access to all those)
  • I suspect the parts of our shell scripts that are for loops could correspond well to terraform modules
  • I just need to say out loud that I am still a little freaked out by terraform. It's this whole other ecosystem that has churn and will need to be kept current. I wasn't confident enough to try migrating the google provider from 2.x to 3.x for aaa without maybe accidentally blowing away the cluster. OTOH as we write more bash in lib*.sh files we're also creating our own ecosystem with possibly inconsistent naming, lack of testing, etc.
  • I fell back to using a shell script for creating projects, but when I have time would be willing to see what rewriting that as terraform would be like

spiffxp avatar Apr 28 '20 17:04 spiffxp

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Jul 27 '20 18:07 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot avatar Aug 26 '20 19:08 fejta-bot

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

fejta-bot avatar Sep 25 '20 20:09 fejta-bot

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Sep 25 '20 20:09 k8s-ci-robot

/remove-lifecycle rotten /lifecycle frozen /priority important-longterm I don't have the bandwidth for this, and this issue is maybe too broad to stay open, but I think the point stands that:

  • continuing with our bash as-is will deepen our tech debt
  • we aren't comfortable enough with our bash to allow automation to run it

Whatever we use, even if it's bash, we need:

  • tests to enforce conventions
  • tests that build trust in our ability to refactor
  • tests that build trust in our ability to have automation run this
  • confirmation of what changes will result

In an ideal world, we would have:

  • ability to reconcile audit output with infra/gcp configs
    • if audit reveals missing resources, create them
    • if audit reveals unknown resources, suggest deletion or new configs to add
  • automated deployment on PR merge (with reliable postsubmits / discoverable postsubmit results)

Even though I'm not a terraform native, to me this sounds really aligned with terraform:

  • modules for organization / re-use
  • plan / apply to build trust in automation
  • terratest or something similar to enforce conventions

There might also be a middle ground where we want some common patterns described in yaml instead, e.g.

  • staging-project -> results in kubernetes.io group, manifest file, gcp project, service accounts, iam permission changes, etc.
  • public-app -> results in kubernetes.io group, aaa namespace, manifests, etc.

spiffxp avatar Jan 23 '21 18:01 spiffxp

/reopen d'oh, forgot this critical step

@hasheddan had also discussed possibly demoing crossplane for us (slack ref: https://kubernetes.slack.com/archives/CCK68P2Q2/p1611757501019900)

spiffxp avatar Feb 08 '21 17:02 spiffxp

@spiffxp: Reopened this issue.

In response to this:

/reopen d'oh, forgot this critical step

@hasheddan had also discussed possibly demoing crossplane for us (slack ref: https://kubernetes.slack.com/archives/CCK68P2Q2/p1611757501019900)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Feb 08 '21 17:02 k8s-ci-robot

@spiffxp preparing whenisgood as we speak :)

hasheddan avatar Feb 08 '21 17:02 hasheddan

this is going to be fun, using kubernetes to manage the kubernetes infrastructure :D

rikatz avatar Feb 08 '21 17:02 rikatz

:heart: to @ameukam for cross-linking PR's I've been contributing to refactor the bash in infra/gcp

https://github.com/kubernetes/k8s.io/pull/2188 takes a tentative step toward using YAML instead of hardcoded bash variables / arrays as the source of our configuration data

spiffxp avatar Jun 11 '21 22:06 spiffxp

An update on the current state of the bash in infra/gcp.

Over the past few months, as I've worked to reconcile inconsistencies or unmanaged resources discovered via our automated audit PRs, I've been trying to nudge the bash in a consistent direction.

The principles I've tried to follow are:

  • extract lib_foo.sh files for different GCP services, eg: lib_iam.sh for IAM, lib_gsm.sh for Google Secret Manager
  • try for some level of consistency in function naming:
    • ensure_[removed_]_{resource} for creation/deletion of resources
  • refactor ensure-foo.sh files:
    • pull everything into functions such that a main entrypoint at the bottom is responsible for kicking off execution
      • makes it easier to test specific parts of a script
      • makes it easier to reuse other functions (vs. relying on order of definitions)
    • write functions such that they can operate on a list of args (e.g. enable_services foo bar baz)
    • use arrays more often, and pass those arrays as lists of args
      • less noise, and support for comments, when doing complicated multi-line things
      • ability to dynamically set flags
  • scope the set of resources a script manages such that less-privileged-than-org-admin roles could run these scripts

spiffxp avatar Jun 11 '21 22:06 spiffxp

/milestone v1.23

spiffxp avatar Sep 02 '21 19:09 spiffxp

/milestone v1.24

ameukam avatar Dec 14 '21 22:12 ameukam