gloo icon indicating copy to clipboard operation
gloo copied to clipboard

Secret Deletion validation against snapshot

Open sheidkamp opened this issue 1 year ago • 6 comments

Description

Allow the deletion of secrets despite unrelated errors/warnings.

When validating a modification to a snapshot:

  • the snapshot is cloned
  • the resource is modified or deleted in the clone
  • the modified snapshot is translated and the errors and warnings are extracted from the reports
  • if there are errors, or if there warnings and warnings are not allowed, the modification is rejected.

If the warnings or errors were present in the original snapshot, no updates are possible, unless it is an update that completely removes the blocking issues.

In the case of secret deletion, an alternative form of validation will be used when the initial snapshot validation fails:

  • validation will run against a clone of the original snapshot
  • the output of that validation will be compared to the validation of the modified snapshot
  • if the output is the same (or if there are only changes to warnings and warnings are allowed), accept the deletion.

Code changes

  • https://github.com/solo-io/gloo/blob/sheidkamp/secret-delete-narrow/projects/gateway/pkg/validation/validator.go is the center of the changes.
    • The validateProxiesAndExtensions function was extracted from validateSnapshot. This function attempts the translation of the snapshot and performs gloo/extention validation, and returns proxyReports, translated proxies, errors and warnings.
      • This function uses the new opts.collectAllErrorsAndWarnings to indicate that it should not stop validating on the first error, and it should collect and return everything.
    • The compareValidationWithoutModification function was created to re-run validation and compare the output to the previous run
  • https://github.com/solo-io/gloo/blob/sheidkamp/secret-delete-narrow/projects/gloo/pkg/utils/validation/proxy_validation.go
    • GetListenerError was extracted from GetProxyError for reuse in comparing proxy reports.

Because of this extra overhead, we allow this functionality to be turned off by setting the environment variable ENABLE_VALIDATION_AGAINST_SNAPSHOT to false. This functionality will be enabled by default in main, and disabled by default in LTS branches. The long-term plan is to deprecate the environment variable: https://github.com/solo-io/gloo/issues/9146, so we will not add a setting or explicit helm value to manage this field. gloo.deployment.customenv can be used to set the environment variable with helm.

Docs changes

  • None

Context

Users ran into this issue when attempting to delete secrets that had no relation to any resource because they had/errors warning that preventing the validating webhook from ever accepting the delete.

Interesting decisions

General Approach

There were three candidates for the overall approach to determine if a secret is in use:

  1. Traversing the snapshot to look for the secret. This has the advantage of being able to directly validate whether or not a secret is in use. In terms of implementation there were 2 possible approaches:
    1. Code the snapshot traversal logic by hand - this was undesirable, as it would require maintenance every time a secret was added/moved/changed in our protobuf definitions. Overall a "brittle" approach.
    2. Use code generation to generate the code needed to traverse the snapshot - this would be ideal, but would be time consuming to setup.
  2. Compare the output of validation with the output of the previous run, and if its the same, conclude that we did not degrade the state of the system by deleting the secret. Also two possible approaches here:
    1. Save the results of the previous validation to compare against - this has a few issues. Where to store them, as the validator will be destroyed and reinstantiated if settings change, and how to handle the first run when there are no previous validations to compare against. There is also the more subtle issue of validation short-circuiting on the first error for a proxy, which is good for performance, and bad for validation, as we need the complete accounting of issues for the comparison
    2. If the secret validation fails, rerun validation against the original snapshot and compare the output - this approach has the downside of requiring extra processing time when deleting secrets. This extra processing time comes from continuing validation after the first error is found, and the revalidation after the fact.
  3. Hook into the secrets client to track all secrets that are being used and check against that list.
    1. This approach relies on the secrets client being used appropriately/consistently, and we are not confident there will not be edge cases.

We ended up taking the approach of rerunning validation against the original snapshot and comparing the results after confirming with the client that the extra overhead was acceptable.

Comparing validation artifacts

The snapshot object consists of arrays of resources, so we can be confident that the successive validation attempts will loop over the resources in the same order.

There are 3 outputs of the validation process that we can compare:

  1. Proxies - These are the proxies generated by translation We will compare the proxies by comparing lengths of the lists generated by the two validation runs, and then if they are the same length, compare their hashes. If these match, we have the same proxies
  2. ProxyReports - These are created during gloo validation:
// The Proxy Report should contain one report for each sub-resource of the Proxy
// E.g., each listener will have a corresponding report. Within each listener report is
// a route report corresponding to each route on the listener.

These resources are stored as slices, so the corresponding reports will be in a consistent order. These are easiest to compare when there allowWarnings is false. In that case, we can use reflect.DeepEqual to check if they are identical. If allowWarnings is true, we loop over the ProxyReports and their component ListenerReports. We test the equality of the ListenerReport types, and the equality of the errors they contain. If these are equal, the comparison passes

  1. Errors - errors come from two places, ProxyReports and ResourceReports. The ProxyReports errors are retrieved by a function that loops over slices of the ListenerReports and their sub-components, and so will return the same order each time. ResourceReports are stored as a map, with InputResources as the keys to the corresponding report. InputResources are only identified by their name and namespace, which are not guaranteed to be unique. As part of this feature development, the reporter code in solo-kit was updated to always return the same errors in the same order for a given set of reports. These errors are all stored as a multierror which compiles a number of errors into one error with wrapping under the hood. Because we have ensured that errors are generated in the same manner each time, we can simply compare the resulting error strings from the validation runs to look for differences. When allowWarnings=false, warnings are treated as errors, and collected/compared along with them.

There are two special errors that may be encountered that are not a result of the snapshot itself. If Sync has not yet been run, a SyncNotYetRunError error is appended, and if the wrong number of gloo reports are created (// This was likely caused by a development error.), a GlooValidationResponseLengthError is appended. These errors both short circuit the snapshot validation process and cause it to fail, as we will not have valid validation results to use for comparison.

Dependency changes

  • updated Eris to v.0.5.4 to get As functionality
  • updated solo-kit to 0.34.2 to pull in changes to guarantee consistent ordering of errors/warnings in report validation

Testing steps

New unit tests here added here to capture the different cases identified in the use case matrix

A basic smoke test can be performed by running through the reproduction steps identified in the issue.

Outstanding testing:

  • ~~kube2e tests~~
  • manual tests for errors from proxy validation and extension validation, ~~and automated versions if feasible~~ Unit tests in place, still need manual steps
  • ~~improved unit testing around solo-kit reporter libraries.~~ UPDATED

Notes for reviewers

Solo-kit changes: https://github.com/solo-io/solo-kit/pull/550

Be sure to verify intended behavior by ...

Please proofread comments on ...

This is a complex PR and may require a huddle to discuss ...

Checklist:

  • [ ] I have performed a self-review of my own code
  • [ ] I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have added tests that prove my fix is effective or that my feature works

BOT NOTES: resolves https://github.com/solo-io/gloo/issues/8931

sheidkamp avatar Jan 31 '24 03:01 sheidkamp

Visit the preview URL for this PR (updated for commit 62666be):

https://gloo-edge--pr9122-sheidkamp-secret-del-2wosd43c.web.app

(expires Thu, 07 Mar 2024 18:21:35 GMT)

🔥 via Firebase Hosting GitHub Action 🌎

Sign: 77c2b86e287749579b7ff9cadb81e099042ef677

github-actions[bot] avatar Jan 31 '24 04:01 github-actions[bot]

Issues linked to changelog: https://github.com/solo-io/gloo/issues/8931

solo-changelog-bot[bot] avatar Jan 31 '24 04:01 solo-changelog-bot[bot]

/kick - flaky?

sheidkamp avatar Jan 31 '24 15:01 sheidkamp

We would also need to enable / disable the feature based on a helm value - My preference is to add this as a field in the setting object so the pod will not need to be restarted for changes to take effect, or keep it as an env var and populate the deployment template accordingly

davidjumani avatar Feb 07 '24 19:02 davidjumani

Labelled as "work in progress" until solo-kit changes are merged and pulled.

sheidkamp avatar Feb 20 '24 14:02 sheidkamp

/kick doesn't seem to be an error message

sheidkamp avatar Feb 20 '24 20:02 sheidkamp