etcd icon indicating copy to clipboard operation
etcd copied to clipboard

client: support Root CA rotation on server side

Open yishuT opened this issue 3 years ago • 19 comments

  • In TlsConfig creation, use GetConfigForClient instead of GetCertificate, so that we can load the latest CA.
  • Add a go routine to refresh tls config from given cert/key paths, such reload loop can be enabled through flags.
  • Change TlsInfo to pointer receiver to avoid unnecessary copy
  • Fix all TlsInfo pass by values because sync.Once cannot be copied. Provide TlsInfo.Clone to copy TlsInfo
  • Unit tests and integration tests

Fixes #11555

yishuT avatar Aug 28 '21 04:08 yishuT

Not sure about if we still need e2e tests. In the integration test, server start and client conn with new certs is already tested.

yishuT avatar Nov 01 '21 01:11 yishuT

Can you please cleanup commits? There are over 22 commits (most of them called "fix ..."), could you squash them and maybe rename them so they could be used for review?

serathius avatar Nov 02 '21 11:11 serathius

@serathius ping on this. Happy NY

yishuT avatar Jan 06 '22 02:01 yishuT

ping again.

I think we need this feature anyway as root ca rotation is one of the fundamental operations.

cc @serathius bcz you helped review this PR before. Please review it or help me find the person to review. Thanks

yishuT avatar Apr 07 '22 02:04 yishuT

Sorry for delayed response, It's in my review queue. Can you fix the conflicts and rebase the change?

serathius avatar Apr 07 '22 07:04 serathius

change to Draft until I fix the broken tests

yishuT avatar Apr 11 '22 05:04 yishuT

Codecov Report

Merging #13307 (8aeb985) into main (3dce380) will increase coverage by 0.48%. The diff coverage is 76.00%.

@@            Coverage Diff             @@
##             main   #13307      +/-   ##
==========================================
+ Coverage   72.58%   73.06%   +0.48%     
==========================================
  Files         469      469              
  Lines       38389    39178     +789     
==========================================
+ Hits        27864    28627     +763     
- Misses       8754     8757       +3     
- Partials     1771     1794      +23     
Flag Coverage Δ
all 73.06% <76.00%> (+0.48%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
etcdctl/ctlv2/command/util.go 3.93% <0.00%> (ø)
server/config/config.go 79.76% <ø> (ø)
server/etcdmain/util.go 6.00% <0.00%> (ø)
server/etcdserver/api/rafthttp/transport.go 85.71% <ø> (ø)
server/etcdserver/api/v2discovery/discovery.go 63.31% <0.00%> (ø)
server/embed/etcd.go 74.95% <40.00%> (ø)
pkg/proxy/server.go 61.08% <100.00%> (+0.06%) :arrow_up:
server/embed/config.go 73.71% <100.00%> (+0.13%) :arrow_up:
server/embed/serve.go 63.01% <100.00%> (ø)
server/etcdmain/config.go 85.36% <100.00%> (+0.10%) :arrow_up:
... and 29 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 3dce380...8aeb985. Read the comment docs.

codecov-commenter avatar Apr 17 '22 14:04 codecov-commenter

Finally make tests passed. I think the one failed is unrelated to my change

yishuT avatar Apr 17 '22 19:04 yishuT

Hey @yishuT, Great work on the server rotation and test. Would love to have this change in v3.6, however I might not have enough time to properly review this PR. I will be busy addressing the recent data inconsistency https://github.com/etcd-io/etcd/blob/main/Documentation/postmortems/v3.5-data-inconsistency.md, so it might take me couple of weeks before I get back to reviewing this one. Etcd needs to prioritize stability over new features now.

Sorry that I'm unable to this is PR attention it deserves.

serathius avatar Apr 25 '22 08:04 serathius

Hey, that's totally fine. Thanks for the review.

yishuT avatar Apr 25 '22 08:04 yishuT

@serathius qq, to test root CA rotation, I need to create clients with tls config signed by different root CAs. The current Cluster interface in test framework does not allow me to do that. Therefore, I might need to make change to the interface. Do you prefer it to be in this PR which already changes many pieces or in a separate PR?

yishuT avatar Apr 25 '22 22:04 yishuT

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 30 '22 19:07 stale[bot]

Not stale

yishuT avatar Jul 30 '22 19:07 yishuT

cc @spzala @ahrtr to help review

serathius avatar Aug 07 '22 11:08 serathius

Thanks @yishuT for the PR and sorry for the late response.

I will take a look after you resolve the conflict.

ahrtr avatar Aug 08 '22 22:08 ahrtr

thanks. Will try to get to this PR by end of next week

yishuT avatar Aug 12 '22 01:08 yishuT

@ahrtr please review

yishuT avatar Aug 31 '22 03:08 yishuT

Sorry for the late response, I am working on https://github.com/etcd-io/etcd/issues/14370 right now. This PR will be a priority when I finish that one.

I do not get time to review this PR so far, but 1K+ lines of code change seems too big to me. Please consider to break the PR if possible. Will have a closer look later.

ahrtr avatar Aug 31 '22 07:08 ahrtr

Sorry for the late response, I am working on #14370 right now. This PR will be a priority when I finish that one.

I do not get time to review this PR so far, but 1K+ lines of code change seems too big to me. Please consider to break the PR if possible. Will have a closer look later.

this PR touches the TlsInfo, so I need to make change everywhere...

yishuT avatar Aug 31 '22 20:08 yishuT

@ahrtr I have a proposal to break this PR into

  1. TlsInfo -> *TlsInfo so that it's not copied everywhere
  2. Introduce the root CA rotation logic

sg?

yishuT avatar Oct 29 '22 21:10 yishuT

@ahrtr hey, I would like to finish this PR. As discussed before, this PR is too big to review. I am going to split this pr into couple parts.

  1. Make TlsInfo -> *TlsInfo. The reason is that to reload root ca, we need to have a background goroutine.
  2. do the real root ca rotation.

wdyt?

yishuT avatar Jan 17 '23 00:01 yishuT

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar May 21 '23 11:05 stale[bot]

Hi,

I saw this PR has been opened for two years without significant progress. I wonder if anything I can help to move forward this PR.

From my side, I am from Azure Kubernetes Service team. We do have requirement to rotate etcd CA in zero-downtime. This PR seems to be able to solve the problem. If the community needs additional help to merge this PR, please don't hesitate to ping me.

hongbin avatar Jun 06 '23 12:06 hongbin

I can resume the work if the community can review the PR. I didn't receive any response earlier if they like my proposal. maybe @ahrtr ?

yishuT avatar Jun 06 '23 14:06 yishuT

We can split the PR into two as @ahrtr mentioned. @yishuT you need help with that?

hongbin avatar Aug 18 '23 21:08 hongbin

We can split the PR into two as @ahrtr mentioned. @yishuT you need help with that?

I don't have time to work on it recently. Feel free to take it over

yishuT avatar Aug 22 '23 17:08 yishuT

Fixes #11555

Thanks. After an analysis on this PR, I think we can split this PR in this way:

  • Provide implementation for "GetConfigForClient". This will allow server CA rotation to work in term of functionality.
  • Optimize the loading of CA files. In particular, avoid loading the CA files repeatedly on each request.

I am working on the first part: https://github.com/etcd-io/etcd/pull/16500

hongbin avatar Aug 28 '23 20:08 hongbin

@yishuT

Why did this issue get closed? The problem still exists and all of the PRs that were created are still outstanding waiting for maintainer review.

aauren avatar Mar 24 '24 19:03 aauren

@yishuT

Why did this issue get closed? The problem still exists and all of the PRs that were created are still outstanding waiting for maintainer review.

This pr is pending review for yrs. also I don't have time to work on it recently while there is another pr trying to solve the same problem so I closed this pr

yishuT avatar Mar 24 '24 19:03 yishuT

My bad, I thought this was the primary issue tracking the problem not a PR. I should have paid more attention.

Thanks for trying to make this better for all of us. I wish that it had been accepted back when you had opened it.

aauren avatar Mar 24 '24 20:03 aauren