[RFC] Implement pluggable policy authority over NATS

Open autodidaddict opened this issue 1 year ago • 15 comments

Summary

This is an RFC to request discussion on a feature idea regarding the implementation of a pluggable policy authority. The current version of the OTP wasmCloud host enforces security policy to prevent actors from communicating with capability providers for which they do not have claims, but that is the extent of the enforcement. There is no enforcement on actor-to-actor calls and any actor or capability provider can be started.

Rationale

In managed, compliance-controlled, multi-tenant, and many other environments, operators may want the ability to enforce more strict policy on the contents of each wasmCloud host. There may be scenarios specific to an organization where they may want to block one actor from communicating with other actors (or ban actor-to-actor calling altogether). Operators may want to be able to enforce a policy on which actors or providers can be started.

In today's system there is also no way to immediately flag a malicious actor to block it from being able to continue operating. If a policy server were to be consulted, then bad actors (or providers 😎) could be banned from performing any action.

Desired Implementation

In the proposed implementation, if the wasmCloud host is started with the environment variable WASMCLOUD_POLICY_TOPIC set, then additional enforcement checks will be made when requests to start actors, start providers, or invoke actors and providers occur. When this topic is set, the following payload will be sent out on the topic:

{
    "requestId": "string",
    "source": {
        "publicKey": "Mxxx or Vxxx",
        "contractId": "nullable",
        "linkName": "nullable",
        "capabilities": [ "wasmcloud:xxx", "wasmcloud:yyy" ],
        "issuer": "Axxx",
        "issuedOn": "xxxx",
        "expiresAt": 1660608232,
        "expired": false, 
    },
    "target": {
        "publicKey": "Mxxxx",
        "issuer": "Axxxx",
        "contractId": "nullable",
        "linkName": "default"
    },
    "action": "[start_provider | start_actor | perform_invocation]",
    "host": {
         "publicKey": "Nxxxx",
         "latticeId": "01234...",
         "labels": [
             "(key)" : "value"
         ],
        "clusterIssuers": [ "Cxxxx", "Cxxxy" ],
     }
}

In response, the host will expect a result that has the following shape:

{
    "permitted": "true | false",
    "message": "error ipsum",
    "requestId": "string"
}

This response indicates whether the requested action is permitted or not. If the action is not permitted, a message field may optionally be supplied to provide contextual information suitable for log emission to explain the policy failure. The request_id field is returned matching the corresponding request as a convenience.

To keep from bogging the system down by consulting the policy authority prior to every single call, it will only perform the consultation once for a given action/source/target combination. Once a result is obtained, that result will be cached. If a policy changes, then the policy authority can publish on the topic indicated by the WASMCLOUD_POLICY_CHANGES_TOPIC environment variable to receive notification for when to invalidation portions of the cache.

The authority can re-evaluate all of its previous policy decisions upon the change of the policy itself. As a result, it can publish a list of the following JSON structures to allow hosts to either modify their caches or to purge the appropriate entries and lazily re-evaluate.

{
    "requestId": "xxxx",
    "permitted": "true | false",
    "message": "...."
}

Additional Notes

One of the big benefits of this approach is that we could actually still use OPA, we would just provide a proxy that listens via the indicated NATS topics to enforce policy checks. Further, any wasmCloud actor could be written to utilize the wasmcloud:messaging contract and NATS provider to build a policy evaluation service. You could store policy data in a blob store, in a key-value store, and the logic in the actor itself.

Rejected Alternatives

One of our original ideas was to simply accept an OPA URL and ship the policy evaluation off to an open policy agent service. There were a number of subtle problems with this. The first, of course, is that not all wasmCloud users are also OPA users, and so this would potentially alienate a group of users.

Secondly, fixed HTTP URLs dramatically hinder portability in a system that is broadly distributed like wasmCloud's lattices. The HTTP URL would need to be configured to be consistent across all hosts, or every single host starting scheduler would have to provide a one-off URL for that host to use.

Aug 01 '22 19:08 autodidaddict

A couple thoughts:

What would an example or a recommended first implementation for developers of this look like? Is it an embedded service within the wasmCloud host that can evaluate policy (like OPA), a standalone service you have to run alongside wasmCloud (a sidecar), or something else?
Will the policy evaluation messages reuse the :lattice_nats or :control_nats connections? I'm concerned that if an actor can subscribe to messages, a "bad actor" could simply listen on that topic and respond with authorized to each policy query and then submit commands of its own. (Side note, will this infinitely recur for the first invocation of a policy engine app? Actor is invoked when receiving a message, host queries actor to see if it's allowed to receive a message, actor receives a message 🐢 )
If the request fails to the policy engine, do we fail open or fail closed? Failing open is a security hole, failing closed can result in an unusable system if not done properly, so I'm curious about thoughts here. I would think that failing closed better exemplifies our deny-by-default principles and we can provide enough logs and ability to SIGHUP config so that an initial failure to query a policy engine doesn't bork a host

Aug 01 '22 19:08 brooksmtownsend

This is entirely up to the developer. I would say the easiest to do would be to create an actor as mentioned above that just listens on a specific topic for policy requests
I'd prefer control NATS for the reasons you mention. 2a. Presumably you wouldn't be enforcing policy checks on the wasmCloud host that is housing the policy enforcement actor, specifically so you don't create the infinibad loop
Definitely worth chewing on, but I'd assume that failure would switch behavior back to the default way wasmCloud host works - enforce capability claims and allow actor-2-actor. This would be "no less secure than a regular wasmCloud host".

Aug 01 '22 20:08 autodidaddict

re: 3 - we'd probably also want to implement some retry where it will attempt to get back into a state where it's evaluating policy. Also, if policy answers are being cached, the blast radius of loss of availability of the policy service is pretty limited.

Aug 01 '22 20:08 autodidaddict

Re: re: 2 We should make sure we document that you shouldn't run the policy actor (if you are using one) on the lattice nats when we create this

As for 3, I also think failing open is the proper policy here as @autodidaddict pointed out that it would be the same as a default wasmcloud host (things are still signed and verified). As for retries, I think we should have an exponential backoff to start, but I also think in the spec, we should also note in the spec that to force turning on policy after failure, a service can publish to the WASMCLOUD_POLICY_CHANGES_TOPIC to force the host to try again. In addition, before we accept this RFC, we should add what the data structure should look like for the WASMCLOUD_POLICY_CHANGES_TOPIC.

Otherwise, this looks like a great start. I really like that it doesn't lock people in to anything and allows for easy integration with already existing policy systems

Aug 02 '22 22:08 thomastaylor312

I've added some detail to what payload shape and meaning appears on the WASMCLOUD_POLICY_CHANGES_TOPIC

Aug 04 '22 13:08 autodidaddict

Re: re: 2 We should make sure we document that you shouldn't run the policy actor (if you are using one) on the lattice nats when we create this

Agreed on the documentation. Though it being on the lattice is fine - it just can't be in a host that has policy enforcement enabled. We could probably automatically add a host value that indicates if policy enforcement is enabled so that it can be excluded from auctions for such an actor.

Aug 04 '22 13:08 autodidaddict

updated sample above to add lattice_id to host attributes.

Aug 08 '22 16:08 stevelr

If start | call is intended to be an expression, it would be easier to parse if it's a list of strings:

action = [ "start_provider" ]

action = [ "start_provider", "stop_provider" ]

the list would be interpreted as 'AND' - all actions must be permitted for the policy service to return true

Aug 08 '22 16:08 stevelr

Only one action is ever intended to be evaluated at once. In other words, there's no expectation that the host would check multiple call types in a single request. Also, we don't intend to do policy checking on stop. However, being explicit between start_actor and start_provider or call_provider or call_actor could help make things more helpful and self-documenting.

Aug 08 '22 16:08 autodidaddict

ok I wasn't sure if the | was intended to be a regex-like operator. I think all actions should be a single word with no spaces. If it's one action per call then we don't need a list for the action field.

Aug 08 '22 17:08 stevelr

Nit for documentation, related to Steve's confusion:

"action": "[start_operator | start_actor | perform_invocation]",

Should be

"action": "start_operator | start_actor | perform_invocation",

I.e. drop the bracket. That makes it consistent with

"actionPermitted": "true | false",

Aug 08 '22 17:08 connorsmith256

Ok I'll edit make that change in a bit. Couldn't figure out which syntax was more clear.

Aug 08 '22 18:08 autodidaddict

changed field names to be camel case. changed expiresInMin to two fields: expiresAt (u32 seconds since epoch in UTC) and expired (boolean)

expired is needed if an actor is to evaluate it - since an actor can't directly access a clock
expiresAt is needed in case the service needs to forward the request somewhere else, the receiver can evaluate the time (assuming clocks between servers are in sync - or "close enough") and it will be evaluated correctly regardless of the latency involved in transmitting the query.

Aug 16 '22 00:08 stevelr

implemented by #442, I hesitate to close this for now or ask for converting it to an ADR only because I feel the policy API may change in the future.

Thoughts @stevelr @autodidaddict

Aug 17 '22 17:08 brooksmtownsend

I think now that there are PRs that implement this, we should update the wasmcloud.dev docs accordingly to have a section on enabling policy enforcement. Then, if/when the API changes we can just update that documentation.

Per the spirit of a request for comment, we've gathered the comments and started working, so the RFC should be "done" IMHO (assuming there's a discrete task somewhere for documenting this on wasmcloud.dev)

Aug 17 '22 18:08 autodidaddict

Closing this as we've implemented this RFC as the policy service, though we may need to amend some of the technical details here if we change that API as it's still experimental.

Mar 24 '23 18:03 brooksmtownsend

wasmcloud-otp wasmcloud-otp copied to clipboard

[RFC] Implement pluggable policy authority over NATS

Summary

Rationale

Desired Implementation

Additional Notes

Rejected Alternatives

wasmcloud-otp
wasmcloud-otp copied to clipboard