rfcs rfc: Introducing SECRETs for Secure Storage of Sensitive Information

as title

Mar 18 '24 14:03 tabVersion

A step forward is to apply this secret framework to our mysql/PG connectors, which also expose secrets like username and password in plain texts. @StrikeW

Mar 19 '24 04:03 neverchanje

A step forward is to apply this secret framework to our mysql/PG connectors, which also expose secrets like username and password in plain texts. @StrikeW

IIUC, the idea of SECRETs is not limited to a specific connector in the RFC, we can apply to cdc connectors indeed.

Mar 19 '24 04:03 StrikeW

I believe several topics need to be enriched.

Syntax

What about DROP, ALTER?

What's the behavior of DROP, ALTER?

Yes, I did not mention this part because I do not want to allow altering credentials for now. It can be troublesome to apply changes, especially if we may need to distribute a cert file. For Vault backend, we just need to trigger a refresh and rebuild the async stream inside the actor, and it shall be fine. But with the meta backend, it may require a mutation to all related actors. Users may have to wait a long time to wait for the barrier to get collected. I hope to make the process in a lazy way but there is no rpc request from CN to meta.

Implementation

How will the secrets be stored? Is there any security improvement than other catalogs?

I don't think encrypting the credentials within the cluster is worthwhile since access to the database is already private enough. The MAIN purpose of this feature is to allow other team members to access some secret keys without knowing what the keys are. Of course, we can design a comprehensive secret control just like what aws does but I think it is an overkill.

When is a secret resolved? What's the data flow and control flow for using a secret for some source/connection?

It is stored in plaintext, yes, so no need to resolve it (or encrypt it with a self-hosted private key, but I do not think it can prevent leaking credentials if the meta already gets hacked). For the Vault backend, we access the credentials when building the actors, in the from_proto process.

What happens if a secret is altered while being used? How do we propagate the changes?

I have no plan supporting alter now. And I am up with the idea of "propagating the changes" because changing credentials will influence all existing clients, not only newly created ones. But we still need a promising way to propagate the changes.

Mar 19 '24 06:03 tabVersion

A step forward is to apply this secret framework to our mysql/PG connectors, which also expose secrets like username and password in plain texts. @StrikeW

IIUC, the idea of SECRETs is not limited to a specific connector in the RFC, we can apply to cdc connectors indeed.

Yes, I do not intend to make this a complicated design, just to make it a reference to a static string.

Mar 19 '24 07:03 tabVersion

I have no plan supporting alter now. And I am up with the idea of "propagating the changes" because changing credentials will influence all existing clients, not only newly created ones. But we still need a promising way to propagate the changes.

Wow. IMO this should be considered from the day one of the design. Regardless of whether it is a goal for the initial stage of implementation, it will definitely be requested in the future.

Mar 21 '24 06:03 BugenZhao

I have no plan supporting alter now. And I am up with the idea of "propagating the changes" because changing credentials will influence all existing clients, not only newly created ones. But we still need a promising way to propagate the changes.

Wow. IMO this should be considered from the day one of the design. Regardless of whether it is a goal for the initial stage of implementation, it will definitely be requested in the future.

Actually I agree with @tabVersion that we shouldn't allow altering. We may support some dynamic secret source/reference in the future to support some advanced use cases like password rotation, but altering a secret in our own storage doesn't sound good to me as it causes more trouble than convenience. Users will be sure that secrets are correct when they successfully create sources/sinks by referencing them. In that case, DROP must be supported. (I think it's a natural requirement 😄)

I'd rather separate altering secret and dynamic loading apart. And then I agree with @BugenZhao on the latter.

Mar 21 '24 07:03 arkbriar

Why using ALTER to do password rotation isn't reasonable? (like I can update a key in GHA secret)

Anyway, I agree with Bugen that either alter or rotation should be discussed early. Otherwise we might be in larger trouble in the future.

Mar 21 '24 08:03 xxchan

Why using ALTER to do password rotation isn't reasonable? (like I can update a key in GHA secret)

Altering a secret makes it implicit when it is updated in references. Dropping then recreating is more explicit though they are in fact equivalent. Users will be aware of what they are going to do and what the consequences would be like, as DROP hints that dependents will break, while ALTER doesn't.

In terms of external vaults that provides password rotation, although the way to support dynamic loading may be similar, I'd rather believe that they have enforced the users to think about what to do and what's the expected result in advance to use the password rotation feature provided, for example, making sure the old and new credentials both work during the rotation. That way we just need to do our best to load the new credentials in time.

Mar 21 '24 08:03 arkbriar

I have no plan supporting alter now. And I am up with the idea of "propagating the changes" because changing credentials will influence all existing clients, not only newly created ones. But we still need a promising way to propagate the changes.

Wow. IMO this should be considered from the day one of the design. Regardless of whether it is a goal for the initial stage of implementation, it will definitely be requested in the future.

Actually I agree with @tabVersion that we shouldn't allow altering. We may support some dynamic secret source/reference in the future to support some advanced use cases like password rotation, but altering a secret in our own storage doesn't sound good to me as it causes more trouble than convenience. Users will be sure that secrets are correct when they successfully create sources/sinks by referencing them. In that case, DROP must be supported. (I think it's a natural requirement 😄)

I'd rather separate altering secret and dynamic loading apart. And then I agree with @BugenZhao on the latter.

Yes, I have the same concern with @arkbriar about credential rotation. We've met the case requiring cert rotation in PoC, where I haven't come up with a compatible and graceful plan.

Details about the case: They have a disk mounted to all containers and change the cert periodically. They don't know the exact time to change the cert (another team does the change), but they know both cert versions gonna be valid for a while. The solution is we keep waiting until the kafka SDK cannot pass the auth and drop the actor by recovery, letting the SDK load the new cert from the fixed dir again.

For the case above, I am not sure if we really want the meta to keep track of the file. So I'd leave it static for now and not enforce secret for SSL config in properties.

@xxchan Anyway, I agree with Bugen that either alter or rotation should be discussed early. Otherwise we might be in larger trouble in the future.

As I mentioned above, the main idea of the RFC (on the impl level) is to provide an interface to load credentials from "backend". We are going to support meta-static backend and Vault backend at this stage. In this design, we always dispatch credentials with table fragments. I don't mean to say no to a more dynamic backend and the RFC does not block any approach to make the process dynamic for altering, because rebuilding the actor can fix the issue anyway.

In my point of view, we have no mature roadmap telling Rw will not encounter the case above or that we pivot to a cloud-only solution, keeping all things controllable. The open-source kernel should be compatible with as many cases as possible and it is ok to say "not yet" if we don't have a comprehensive enough sight for requests in this topic.

Mar 21 '24 09:03 tabVersion

I have no plan supporting alter now. And I am up with the idea of "propagating the changes" because changing credentials will influence all existing clients, not only newly created ones. But we still need a promising way to propagate the changes.

Wow. IMO this should be considered from the day one of the design. Regardless of whether it is a goal for the initial stage of implementation, it will definitely be requested in the future.

Yes, we all know the credentials can be changed in the future but no one can tell they want to change in which way atm.

The thing has a low priority because 1) most auth only use aksk, which belongs to a dedicated assume role and it is not gonna change once setup; 2) as mentioned above, a pause-resume config change process can serve as a cover-up plan, which can be easy to impl. No need to be spec to a new one.

Mar 21 '24 10:03 tabVersion

I also agree that a comprehensive implementation plan might not be needed too early, but some high level ideas might be helpful to demonstrate the flexibility of the current design.

But to clarify,

As I mentioned above, the main idea of the RFC (on the impl level) is to provide an interface to load credentials from "backend".

Do you mean you are only discussing this interface in the RFC?

trait GetCredentials {
    fn fetch_credential(...) -> Result<Vec<u8>>
}

I'm asking because according to

For Vault backend, we just need to trigger a refresh and rebuild the async stream inside the actor, and it shall be fine. But with the meta backend, it may require a mutation to all related actors.

It seems different secret backend needs different handling at the usage site? (i.e., at the actors) And that's not in the scope of the RFC?

because rebuilding the actor can fix the issue anyway.

I guess @fuyufjh is unhappy with that 😟

Mar 21 '24 11:03 xxchan

I also agree that a comprehensive implementation plan might not be needed too early, but some high level ideas might be helpful to demonstrate the flexibility of the current design.

You can give one in a new RFC.

But to clarify,

As I mentioned above, the main idea of the RFC (on the impl level) is to provide an interface to load credentials from "backend".

Do you mean you are only discussing this interface in the RFC?

The RFC mainly talks about how secret catalog interact with users, like you summarized before https://github.com/risingwavelabs/rfcs/pull/86#discussion_r1530544675, and it is what the RFC for. It is a quite simple one. I don't take the flexibility into account because https://github.com/risingwavelabs/rfcs/pull/86#issuecomment-2011820346, I don't think we've collected enough cases to make a wonderful design and I believe a static credential can meet most of the requirement of most users. Furthermore, putting the credentials in the table fragment when creating actors won't block any future work as all stream job actors are built in this way.

I will mark further discussion on flexibility Off-Topic unless you give a concrete example we need to do it now or why config change cannot handle this situation.

I'm asking because according to

For Vault backend, we just need to trigger a refresh and rebuild the async stream inside the actor, and it shall be fine. But with the meta backend, it may require a mutation to all related actors.

It seems different secret backend needs different handling at the usage site? (i.e., at the actors) And that's not in the scope of the RFC?

I mention this part in https://github.com/risingwavelabs/rfcs/pull/86#issuecomment-2005953873, handling the backends in from_proto process as a hint to implementation. The meta node dispatch either the credential itself or a token to fetch from Vault, actors do the work independently. I don't think I need to write a detailed guide on how to put credentials in their place before passing to SDK. I am not talking about changing, building from proto is the common path for all actors.

To further address your question,

For Vault backend, we just need to trigger a refresh and rebuild the async stream inside the actor, and it shall be fine. But with the meta backend, it may require a mutation to all related actors.

What I mean here is we don't need to push a mutation when credentials in Vault change. (and we cannot do that because we don't know when it changes, "let is crash" is the only option) -- when we want to impl a dynamic credential

because rebuilding the actor can fix the issue anyway.

I guess @fuyufjh is unhappy with that 😟

As I mentioned above, it is a cover-up plan in case we have urgent requests for this. The happy path is still suggesting drop secret and re-create.

Mar 21 '24 12:03 tabVersion

I mention this part in #86 (comment), handling the backends in from_proto process as a hint to implementation. The meta node dispatch either the credential itself or a token to fetch from Vault, actors do the work independently.

I'm not sure if I totally get it here. Do you mean that for secrets that are stored in our meta store, the secret will be resolved (by resolve I mean to map from secret name to secret content, i.e. the password/key itself) when generating plan node?

Then how do you track the dependency from streaming jobs to secrets? If later sometime we do plan to support drop/alter, how do you (or how do our users) know which streaming jobs will be affected?

What about provide a meta API to resolve a secret when need to use it?

I'm not suggesting that we must do it in this or that way, I'm just saying that "when to resolve the secret" seems to have profound impact on future extension and user experience, which need to be discussed in the RFC. We should analyze the pros and cons of different approach instead of being vague in such a serious and official design document.

Mar 22 '24 05:03 stdrc

Just read the discussion above.

First, I do think alter secret must be considered, although it doesn't need to be implemented in the first place. However, I don't think we need to proactively trigger a "dynamic secret reloading", instead, just wait for the credential to be invalid, and then a recovery will be triggered.

Explain:

The major purpose of it is allowing users to rotate the credentials, which is a common practice for many use cases. Note that the rotation may be manual (for example, the old credential was possibly leaked or updated) or automatic (driven by TTL e.g. 180 days). In both cases, it doesn't make sense to let users recreate the source or table, but only the credential need to be updated. The most intuitive approach is obviously alter secret.

While for "dynamic secret reloading",

Secret rotation is a very rare operation, and doesn't worth us designing a new code path to handle.
It's a typical procedure to restart the dependent system when users update a credential. RisingWave just follows the convention.

In my mind, we may provide a command (https://github.com/risingwavelabs/risingwave/issues/14700) to trigger recovery (better than restart). Alternatively, users can just do nothing, and RisingWave will do recovery and recreate all actors with new secrets upon the old credential expiration.

Thus, strong +1 for @stdrc

Do you mean that for secrets that are stored in our meta store, the secret will be resolved (by resolve I mean to map from secret name to secret content, i.e. the password/key itself) when generating plan node?

Then how do you track the dependency from streaming jobs to secrets? If later sometime we do plan to support drop/alter, how do you (or how do our users) know which streaming jobs will be affected?

Resolving secret on create table/source seems unacceptable for many reasons. Regarding of my idea here, we only need to make sure to resolve the secret on its actual use and not being cached, such as when the SourceExecutor initializes.

Mar 22 '24 08:03 fuyufjh

rfcs rfcs copied to clipboard

rfc: Introducing SECRETs for Secure Storage of Sensitive Information

rfcs
rfcs copied to clipboard