ratify icon indicating copy to clipboard operation
ratify copied to clipboard

Support Fail Open/Close Policy with Gatekeeper

Open akashsinghal opened this issue 1 year ago • 0 comments

What would you like to be added?

Ratify should be configurable to align with Gatekeeper's fail open/close strategy.

Fail open behavior: Gatekeeper is set by default to fail open. In this case, any failures during the webhook processing will not block resource creation. For mutation, the default behavior of the Assign resource is fail close. Furthermore, external data system errors (timeout from ratify or inability to reach ratify) are not considered webhook failures and thus are left to policy evaluation to determine how to handle this. Currently, the ratify sample policy considers system errors failures and thus the enforcement action of deny will be applied on the constraint. This does not align with the default fail open behavior of Gatekeeper.

Action Items:

  • Ratify must introduce a fail open/close flag. This flag will override the default behavior of the mutating Assign resource
  • Ratify should remove the system_error policy block from the existing library templates, and instead introduce a second constraint template. This template is solely responsible for handling the system_error case. For default fail open scenarios, the second constraint template will have an enforcement policy of deny and for overridden fail close scenarios the enforcement action will be deny.

More detailed explanation: There is a fundamental limitation with GK constraint templates where we can selectively add multiple enforcement points in a single constraint templates. Ideally, we would want to keep the existing constraint template but allow for system_error policy block to result in warn (when fail open). This isn't possible currently and there's a tracking issue on GK discussing it. GK maintainers recommended we go down the second Constraint Template route. Now if we add a second CT, one could argue that we should only apply the second one when we have set failure policy to be fail close. However, even if it's fail open we want to at least surface system errors from external provider. Otherwise, how would the user even know that the external provider is failing? This is why the second CT is always required.

Anything else you would like to add?

There's some potential considerations for adding a second CT to solve this issue:

  1. We've now doubled the # of constraints on the cluster used by Ratify. This will theoretically double the audit interval processing time.
  2. Every resource will result in 2 identical external data provider requests. Theoretically, these should be send in close succession which will mean Ratify's http server cache should respond without having to do any actual verification but there's no guarantee to this. We'd need to perform extra perf analysis to confirm.

Are you willing to submit PRs to contribute to this feature?

  • [X] Yes, I am willing to implement it.

akashsinghal avatar Apr 25 '23 17:04 akashsinghal