k8s.io [Umbrella Issue] Auditing improvements

EDIT: Opting to treat this as an umbrella issue instead of placeholder to noodle on ideas

An umbrella issue to capture ideas and suggestions to improve our audit process.

Currently:

we have a hand-written bash script that dumps information about services we care about
we have a prowjob that runs this script periodically, as a serviceaccount that we've tried to give just enough read access to via a custom IAM role
the results of this script are submitted via PR by the prowjob
humans manually review the PRs (usually me)
- review comments are used to confirm expected changes, ideally with links to issue comments or PRs, e.g.
  - "This thing changed because someone ran scripts when that PR merged"
  - "This was me doing what I described in link_to_issue_comment"
- review comments and followup issues are used to ask questions, or clean up things that need to be cleaned up, e.g.
  - "Looks like a new GCP feature rolled out, we'll want to disable these"
  - "Hey @foo did you change something manually here?"
  - "This is way too much noise, let's make our audit script ignore this"
the job runs 4 times a day

Some problems with this:

I just rattled the above off the top of my head. This is poorly documented, a.k.a audit/README.md needs work
It takes too long to dump information, currently about 100m for the job to run
The review burden is high, it's 100% manual right now
Our audit output format (status) is not easily reconciled with our (uh, lack of) input format (spec) (ref: https://github.com/kubernetes/k8s.io/issues/516#issuecomment-766157452)
All of the above means the feedback cycle is... too long
We're using a completely home-rolled thing that we need to maintain ourselves, but like... two people have ever touched it
We're not getting exhaustive dumps, so for all I know there are mysterious things lurking out there

TODO: flesh these out into issues? or just track a list here

Our audit results are not easily reconciled:

What changes are live that aren't in source?
- ... reconciling this now requires lots of focused human review of bash, and even then I'm not sure I'd trust it
- smaller updates from audit script will help this
Can we reduce the toil involved in updating source with previously untracked changes?
- at the moment a human needs to know which script(s) to update, and how
- if we used some other tooling (e.g. terraform, crossplane) would it make sense to try dumping in that format?
- could we recognize common cases?
What changes are in source that aren't live?
- at the moment a human needs to figure this out
Can we reduce the toil in making them live?
- at the moment a human needs to know which script(s) to manually run

We can't audit or dump everything due to IAM issues:

instead of a bunch of pre-defined roles, can we aggregate into a custom role?
do we even need a custom role? why not simply assign roles/viewer at the org level?

Auditing dumps are too slow:

Try using Cloud Asset Inventory instead, like one or both of:
- gcloud asset
- gcloud resource-config bulk-export - https://github.com/kubernetes/k8s.io/issues/1981

Bugs with our audit script right now:

noise reduction: ensure we don't dump etags for all resources - https://github.com/kubernetes/k8s.io/issues/2062
missing resources: BigQuery datasets - https://github.com/kubernetes/k8s.io/issues/2029

/wg k8s-infra /area infra/auditing /area access /priority important-longterm /kind cleanup

cc @dims @thockin @cblecker @hh

Feb 16 '21 18:02 spiffxp

I'm getting close to having our existing job run hourly. I'm not sure where we would want them to run as post-submits. (any change to k8s.io?) I assume your wanting that to be able to type the changes via PRs into audit updates.

Feb 24 '21 07:02 hh

Glanced at Cloud Asset Inventory to look into something else

Two things that disqualify ~it~ gcloud asset search-all-resources as a general-purpose solution for us, I think:

It seems to be more about dumping that you have assets, and less about how they are configured (e.g. services enabled on a project, release channel of a cluster, etc.)
It doesn't support Secrets as an asset type https://cloud.google.com/asset-inventory/docs/supported-asset-types#searchable_asset_types

But, gcloud asset search-all-iam-policies dumps IAM policy bindings really quickly, and for most of the resources we commonly we use:

# a prow cluster
$ gcloud asset search-all-iam-policies --scope=projects/k8s-infra-prow-build-trusted --format="value(resource)" | cut -d/ -f4-
projects/k8s-infra-prow-build-trusted/serviceAccounts/prow-deployer@k8s-infra-prow-build-trusted.iam.gserviceaccount.com
projects/k8s-infra-prow-build-trusted/serviceAccounts/[email protected]
projects/k8s-infra-prow-build-trusted/datasets/usage_metering_prow_build_trusted
projects/k8s-infra-prow-build-trusted/serviceAccounts/prow-build-trusted@k8s-infra-prow-build-trusted.iam.gserviceaccount.com
projects/k8s-infra-prow-build-trusted

# an e2e project
$ gcloud asset search-all-iam-policies --scope=projects/k8s-infra-e2e-boskos-001 --format="value(resource)" | cut -d/ -f4-
kubernetes-staging-485128143e-asia
kubernetes-staging-485128143e-eu
kubernetes-staging-485128143e
projects/k8s-infra-e2e-boskos-00

# a staging project
$ gcloud asset search-all-iam-policies --scope=projects/k8s-staging-e2e-test-images --format="value(resource)" | cut -d/ -f4-
k8s-staging-e2e-test-images-gcb
k8s-staging-e2e-test-images
artifacts.k8s-staging-e2e-test-images.appspot.com
projects/k8s-staging-e2e-test-images

So if nothing else, and excluding secrets, I could see this being useful to quickly audit/reconcile IAM polices across the org.

A next step would be to look at what sort of info is availabe from gcloud asset export

Mar 04 '21 04:03 spiffxp

https://github.com/kubernetes/k8s.io/issues/1981 covers exploring cloud alpha resource-config bulk-export

May 18 '21 05:05 spiffxp

https://github.com/kubernetes/test-infra/pull/22239 should update the audit job to only bump open PRs if there are new changes in the audit directory, which will hopefully cut down on open PRs with long trails of force-pushes that don't actually change the files that have been reviewed.

May 19 '21 06:05 spiffxp

It currently takes about ballpark 80 minutes to perform a full audit: https://testgrid.k8s.io/wg-k8s-infra-k8sio#ci-k8sio-audit&width=20&graph-metrics=test-duration-minutes

I think we can do better.

gcloud asset list is a thing now, if we want to try munging the yaml / json that dumps into the same format we're currently using, or make our yaml / json match its format

Aug 17 '21 21:08 spiffxp

/milestone v1.23 /priority backlog That said I think speeding this up may be less important than moving things over.

Aug 17 '21 21:08 spiffxp

/remove-priority important-longterm

Aug 17 '21 21:08 spiffxp

/milestone v1.24

Dec 14 '21 22:12 ameukam

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Mar 14 '22 23:03 k8s-triage-robot

/remove-lifecycle stale /lifecycle frozen /milestone clear

Mar 15 '22 05:03 ameukam

/milestone v1.32

Mar 03 '24 14:03 ameukam