[Umbrella Issue] Auditing improvements
EDIT: Opting to treat this as an umbrella issue instead of placeholder to noodle on ideas
An umbrella issue to capture ideas and suggestions to improve our audit process.
Currently:
- we have a hand-written bash script that dumps information about services we care about
- we have a prowjob that runs this script periodically, as a serviceaccount that we've tried to give just enough read access to via a custom IAM role
- the results of this script are submitted via PR by the prowjob
- humans manually review the PRs (usually me)
- review comments are used to confirm expected changes, ideally with links to issue comments or PRs, e.g.
- "This thing changed because someone ran scripts when that PR merged"
- "This was me doing what I described in
link_to_issue_comment"
- review comments and followup issues are used to ask questions, or clean up things that need to be cleaned up, e.g.
- "Looks like a new GCP feature rolled out, we'll want to disable these"
- "Hey
@foodid you change something manually here?" - "This is way too much noise, let's make our audit script ignore this"
- review comments are used to confirm expected changes, ideally with links to issue comments or PRs, e.g.
- the job runs 4 times a day
Some problems with this:
- I just rattled the above off the top of my head. This is poorly documented, a.k.a audit/README.md needs work
- It takes too long to dump information, currently about 100m for the job to run
- The review burden is high, it's 100% manual right now
- Our audit output format (status) is not easily reconciled with our (uh, lack of) input format (spec) (ref: https://github.com/kubernetes/k8s.io/issues/516#issuecomment-766157452)
- All of the above means the feedback cycle is... too long
- We're using a completely home-rolled thing that we need to maintain ourselves, but like... two people have ever touched it
- We're not getting exhaustive dumps, so for all I know there are mysterious things lurking out there
TODO: flesh these out into issues? or just track a list here
Our audit results are not easily reconciled:
- What changes are live that aren't in source?
- ... reconciling this now requires lots of focused human review of bash, and even then I'm not sure I'd trust it
- smaller updates from audit script will help this
- Can we reduce the toil involved in updating source with previously untracked changes?
- at the moment a human needs to know which script(s) to update, and how
- if we used some other tooling (e.g. terraform, crossplane) would it make sense to try dumping in that format?
- could we recognize common cases?
- What changes are in source that aren't live?
- at the moment a human needs to figure this out
- Can we reduce the toil in making them live?
- at the moment a human needs to know which script(s) to manually run
We can't audit or dump everything due to IAM issues:
- instead of a bunch of pre-defined roles, can we aggregate into a custom role?
- do we even need a custom role? why not simply assign
roles/viewerat the org level?
Auditing dumps are too slow:
- Try using Cloud Asset Inventory instead, like one or both of:
-
gcloud asset -
gcloud resource-config bulk-export- https://github.com/kubernetes/k8s.io/issues/1981
-
Bugs with our audit script right now:
- noise reduction: ensure we don't dump etags for all resources - https://github.com/kubernetes/k8s.io/issues/2062
- missing resources: BigQuery datasets - https://github.com/kubernetes/k8s.io/issues/2029
/wg k8s-infra /area infra/auditing /area access /priority important-longterm /kind cleanup
cc @dims @thockin @cblecker @hh
I'm getting close to having our existing job run hourly. I'm not sure where we would want them to run as post-submits. (any change to k8s.io?) I assume your wanting that to be able to type the changes via PRs into audit updates.
Glanced at Cloud Asset Inventory to look into something else
Two things that disqualify ~it~ gcloud asset search-all-resources as a general-purpose solution for us, I think:
- It seems to be more about dumping that you have assets, and less about how they are configured (e.g. services enabled on a project, release channel of a cluster, etc.)
- It doesn't support Secrets as an asset type https://cloud.google.com/asset-inventory/docs/supported-asset-types#searchable_asset_types
But, gcloud asset search-all-iam-policies dumps IAM policy bindings really quickly, and for most of the resources we commonly we use:
# a prow cluster
$ gcloud asset search-all-iam-policies --scope=projects/k8s-infra-prow-build-trusted --format="value(resource)" | cut -d/ -f4-
projects/k8s-infra-prow-build-trusted/serviceAccounts/prow-deployer@k8s-infra-prow-build-trusted.iam.gserviceaccount.com
projects/k8s-infra-prow-build-trusted/serviceAccounts/[email protected]
projects/k8s-infra-prow-build-trusted/datasets/usage_metering_prow_build_trusted
projects/k8s-infra-prow-build-trusted/serviceAccounts/prow-build-trusted@k8s-infra-prow-build-trusted.iam.gserviceaccount.com
projects/k8s-infra-prow-build-trusted
# an e2e project
$ gcloud asset search-all-iam-policies --scope=projects/k8s-infra-e2e-boskos-001 --format="value(resource)" | cut -d/ -f4-
kubernetes-staging-485128143e-asia
kubernetes-staging-485128143e-eu
kubernetes-staging-485128143e
projects/k8s-infra-e2e-boskos-00
# a staging project
$ gcloud asset search-all-iam-policies --scope=projects/k8s-staging-e2e-test-images --format="value(resource)" | cut -d/ -f4-
k8s-staging-e2e-test-images-gcb
k8s-staging-e2e-test-images
artifacts.k8s-staging-e2e-test-images.appspot.com
projects/k8s-staging-e2e-test-images
So if nothing else, and excluding secrets, I could see this being useful to quickly audit/reconcile IAM polices across the org.
A next step would be to look at what sort of info is availabe from gcloud asset export
https://github.com/kubernetes/k8s.io/issues/1981 covers exploring cloud alpha resource-config bulk-export
https://github.com/kubernetes/test-infra/pull/22239 should update the audit job to only bump open PRs if there are new changes in the audit directory, which will hopefully cut down on open PRs with long trails of force-pushes that don't actually change the files that have been reviewed.
It currently takes about ballpark 80 minutes to perform a full audit: https://testgrid.k8s.io/wg-k8s-infra-k8sio#ci-k8sio-audit&width=20&graph-metrics=test-duration-minutes
I think we can do better.
gcloud asset list is a thing now, if we want to try munging the yaml / json that dumps into the same format we're currently using, or make our yaml / json match its format
/milestone v1.23 /priority backlog That said I think speeding this up may be less important than moving things over.
/remove-priority important-longterm
/milestone v1.24
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale /lifecycle frozen /milestone clear
/milestone v1.32