cloud-provider-openstack icon indicating copy to clipboard operation
cloud-provider-openstack copied to clipboard

Inhibit Configuration Caching

Open spjmurray opened this issue 2 years ago • 27 comments

What this PR does / why we need it:

While caching is good from a performance perspective, I believe in this instance the benefits are miniscule, and it also doesn't handle credential rotation, which any security conscious organisation would tell you is a good thing, so exfiltration doesn't have a lasting impact.

Which issue this PR fixes(if applicable): fixes #2284

Special notes for reviewers:

Release note:

Prior to this release cloud configs were cached to improve performance, however this had the side-effect of ignoring credential rotation.
The new `--cache-cloud-config` flag allows this behavior to be tailored to your environment.
The default is `true` to maintain the current behavior for backward compatibility.
You can inhibit caching with the `--cache-cloud-config=false` CLI flag.

spjmurray avatar Jun 30 '23 14:06 spjmurray

Welcome @spjmurray!

It looks like this is your first PR to kubernetes/cloud-provider-openstack 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/cloud-provider-openstack has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. :smiley:

k8s-ci-robot avatar Jun 30 '23 14:06 k8s-ci-robot

Hi @spjmurray. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jun 30 '23 14:06 k8s-ci-robot

Issue is open at https://github.com/kubernetes/cloud-provider-openstack/issues/2284 if you want to discuss this in more length

spjmurray avatar Jun 30 '23 14:06 spjmurray

/ok-to-test

jichenjc avatar Jul 03 '23 01:07 jichenjc

/lgtm

Incidentally the primary cost of this is going to be in additional hits on keystone, although I think it's fair to let users make this determination themselves. The way we handle this in OpenShift is with an annotation on the deployed pod based on a hash of the secret contents. This just means we create a new pod if the secret changes.

mdbooth avatar Jul 03 '23 14:07 mdbooth

are you intend to make this cinder only or the whole CPO (including manila, OCCM etc) ? or we can merge this then do other update?

jichenjc avatar Jul 04 '23 00:07 jichenjc

From my point of view this feature has to be supported on a lower level in pkg/client/client.go. I need some time to take a deeper look into this issue.

kayrus avatar Jul 04 '23 11:07 kayrus

From my point of view this feature has to be supported on a lower level in pkg/client/client.go. I need some time to take a deeper look into this issue.

What are you thinking here? I don't think we'd ever want to update an existing client with new credentials: you'd always create a new one with the new credentials. I'm not convinced our client wrapper needs to support this directly.

mdbooth avatar Jul 04 '23 15:07 mdbooth

What are you thinking here?

I'm thinking about credentials rotation only when old credentials are failing. Do you have objections?

kayrus avatar Jul 04 '23 15:07 kayrus

What are you thinking here?

I'm thinking about credentials rotation only when old credentials are failing. Do you have objections?

That behaviour would be surprising to me, as well as complex to implement. I'm interested to know if this would be surprising to other users, or if these are just my expectations. I'm thinking that a credential rotation workflow is going to go something like:

  1. Create new credentials
  2. Update credentials in cluster
  3. Ensure that everything is working correctly. Rollback if not.
  4. Delete old credentials

I think a user expects that when they update cluster credentials they will be used immediately, which means that they have validated the new credentials in step 3. If we keep using the old credentials until they fail, the user has not validated the credentials in step 3. If there's a problem they might be surprised in step 4 when everything fails and there is no longer a rollback option.

To keep things simple to implement, test, and debug, I think it's simpler that if we get new credentials we use them immediately and start failing immediately if they don't work.

mdbooth avatar Jul 04 '23 16:07 mdbooth

Oh wow, I missed all this chatter, was commenting in the issue. I did mention just now that we could do something a bit more clever by polling the file on disk, and comparing with the version that was used to initialize the client. If that changes then we can perform an atomic update of the client. That'll behave the same, but alleviate the load on Keystone. Critiques?

spjmurray avatar Jul 05 '23 07:07 spjmurray

I've spent a while to optimize auth logic from both the code and API performance perspective. Before each CPO component had its own auth logic and code duplication was insane. This PR doesn't look good because:

  • In most cases the OCCM and other repo components share the same credentials. Implementing a single change only in the Cinder CSI will cause inconsistencies between CPO components.
  • The GetOpenStackProvider func is called only once in cmd/cinder-csi-plugin/main.go and it's not looped, so the cacheCloudConfig global var doesn't make any sense.
  • Credentials rotation must be done in a different way and the logic must consider all CPO components.

I'd close this PR and move the discussion on how to achieve a proper creds rotation into an issue.

kayrus avatar Jul 05 '23 08:07 kayrus

/hold

kayrus avatar Jul 05 '23 08:07 kayrus

New changes are detected. LGTM label has been removed.

k8s-ci-robot avatar Jul 13 '23 10:07 k8s-ci-robot

Yeah @mdbooth as mentioned, it's just a PoC. I can see there are definitely efforts to coalesce the OS client code into a canonical blob, so was hoping the problem would go away eventually. If I get some cycles free I may have a look to see if the CSI code can share what's been done with manilla et al. CSI seems to be in its own little world at present.

spjmurray avatar Jul 18 '23 10:07 spjmurray

Okay @mdbooth down the rabbit hole we go... 😄 Here's what I came up with in an attempt to modulatize the hot reload stuff. There's 2 levels, the easy one, then the hard one. Hit a bit of a roadblock, as you can see in the comments, with the ingress controller and the main thing driven by k8s.io/cloud-provider. But you can get the gist.

spjmurray avatar Jul 18 '23 15:07 spjmurray

I'm really hoping @kayrus can provide some guidance here.

mdbooth avatar Jul 18 '23 16:07 mdbooth

@mdbooth I'll check the PR later. Thanks for patience.

kayrus avatar Jul 18 '23 17:07 kayrus

I believe ostensibly we want to conceptualize this as how the typical controller-runtime app works.

  • The filewatcher stuff acts as an API watch call
  • You can either
    • Read the config each time from the cache and create a new provider/client, but as we've discussed that's pretty crap because you need to do a token exchange, authenticate/get the catalog dance with Keystone. This does have the benefit that anything relying on the config will get updates.
    • Cache clients as we do now to prevent the authentication dance, but use a "shared informer" aka the subscription stuff to do whatever updates we can.

spjmurray avatar Jul 19 '23 08:07 spjmurray

I apologies for the delay. I'll try to take a look at this PR this week. Thanks for the patience.

kayrus avatar Aug 02 '23 09:08 kayrus

No rush! I did have a thought though about k8s.io/cloud-provider. Is that just legacy stuff from when it lived in tree, and thus could be removed?

spjmurray avatar Aug 03 '23 13:08 spjmurray

k8s.io/cloud-provider is a common package for all other cloud providers, primarily golang interfaces.

kayrus avatar Aug 05 '23 15:08 kayrus

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign kayrus for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Aug 22 '23 10:08 k8s-ci-robot

Okay @kayrus I think we're starting to zero in on something good... let me know what you think!

spjmurray avatar Aug 22 '23 10:08 spjmurray

Of note, I've not committed it yet, but I found a solution to getting the OCCM to auto-rotate credentials. Seems we register a cloud provider, then immediately call InitCloudProvider with a *File... from there, in the callback, we can type assert from io.Reader back to a file and extract the Name() and setup the polling file watcher. But that's a pretty huge change as it affects a lot of the testing jazz, so would probably be better as follow up. Like this a lot:

commit 2346c93af6e44f5de911861a527a05a08c25961c (HEAD -> occm_cloudconfig_port)
Author: Simon Murray <[email protected]>
Date:   Tue Aug 22 15:00:18 2023 +0100

    Port OCCM to use CloudConfigFactory

 pkg/openstack/openstack.go      | 152 ++++++++++++++++-----------------
 pkg/openstack/openstack_test.go | 295 ++++++++++++++++-------------------------------------------------
 pkg/openstack/routes_test.go    |  19 +++--
 3 files changed, 163 insertions(+), 303 deletions(-)

spjmurray avatar Aug 23 '23 09:08 spjmurray

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 21 '24 04:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Feb 20 '24 04:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-triage-robot avatar Mar 21 '24 05:03 k8s-triage-robot

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 21 '24 05:03 k8s-ci-robot