etcd icon indicating copy to clipboard operation
etcd copied to clipboard

RFC: StoreV2 deprecation plan

Open ptabor opened this issue 3 years ago • 9 comments

[The content of this post is being edited, and is not yet approved plan of record ]

Background

In 3.4 storeV2 is still extensively used:

  • User can opt in --enable-v2 to unable V2 API that writes solely data to storeV2 files.

  • Membership information is both stored in V2 & V3 (backend) stores

  • Membership information is being read (recovered) from storeV2

  • Publishing membership happens through StoreV2 raft operation.

    • The new ClusterMemberAttrSet applier was implemented in etcd 3.5 so cannot be used by default in 3.5: to allow 3.4->3.5 rollback.
  • During startup etcd assumes each WAL log snapshot is accompanied by storev2 snapshot.

  • Note: etcdctl snapshot restore is not restoring V2 content (producing fake storeV2 with membership information)

Plan:

3.5 release

API:

  • [x] --experimental-enable-v2v3= promoted to --enable-v2v3=... or deprecated. [see https://github.com/etcd-io/etcd/issues/12905]
    • [x] --enable-v2v3=... should allow mounting as root ?
  • [x] --enable-v2 prints deprecation warning [optional: when not combined with --enable-v2v3=... ]

Publish

  • [x] ClusterMemberAttrSet is implemented.

Membership:

  • [x] Membership is only read from backend v2 (https://github.com/etcd-io/etcd/pull/12914)

Etcd startup:

  • [x] --experimental-no-v2=PHASE_1_WRITE_ONLY mode that:
    • cannot be used when --enable-v2 is ON
    • if the storev2 snapshot is found accompanying the (used to be used) WAL log snapshot, its validated whether it has no used-data (apart of membership & version)
    • storev2 is content created on demand using 'etcdctl snapshot restore' logic based on the last WAL snapshot

3.6 release

API

  • [x] --enable-v2 is docomissioned or FAILs if not accompanied by --enable-v2v3=...

Publish:

  • [x] PublishV3 is used solely

Etcd startup:

  • [x] experimental-no PHASE_1_WRITE_ONLY becomes the default.
  • [ ] --experimental-no-v2=PHASE_2_DECOMISSIONED opt in that stops-writing the storeV2 files at all.

3.7 release

  • [ ] PHASE_2_DECOMISSIONED is the only supported mode.

ptabor avatar Apr 30 '21 12:04 ptabor

@ptabor

Thanks for helping with this! (I wanted to have a clean separation between v2 and v3 a couple of years ago but never got time actually to do it).

I am not sure if we are still running the failure injection tests. I would suggest running them to ensure the correctness of this change. Also, we might want more tests on the membership changes and the upgrade path.

/cc @gyuho

xiang90 avatar May 05 '21 18:05 xiang90

@xiang90 The functional tests (https://travis-ci.com/github/etcd-io/etcd/jobs/502908317) looks to me to be a failure injection tests in practice. Do you mean these tests, or there are other tests ?

ptabor avatar May 05 '21 20:05 ptabor

@ptabor

The functional test randomly injects failures into the running servers. Due to the randomness, it might not be super effective to catch issues if it only runs one round in the CI.

What we did previously was to run this test 24*7 for weeks on a set of servers before major releases. I am not sure if it is still the practice since we have not really introduced significant changes to etcd in the recent releases I believe.

xiang90 avatar May 05 '21 20:05 xiang90

@ptabor This looks great.

Regarding

--experimental-enable-v2v3= promoted to --enable-v2v3=... or deprecated

as we discussed, let's mark it as deprecated for 3.5 and remove for 3.6.

Then, there's no need for

--enable-v2 is docomissioned or FAILs if not accompanied by --enable-v2v3=...

?

gyuho avatar May 08 '21 04:05 gyuho

I am not sure if we are still running the failure injection tests.

We do run functional tests with failure injection in the CI but haven't implemented member add/remove

ref.

https://github.com/etcd-io/etcd/blob/6decbe15db13f025068158ae1fb410d7a7dde5fb/tests/functional/functional.yaml#L174-L195

gyuho avatar May 08 '21 04:05 gyuho

I hope/think mandatory steps for v3.5 are done.

ptabor avatar May 14 '21 04:05 ptabor

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 12 '21 05:08 stale[bot]

I can take a look deprecating V2

serathius avatar Jan 13 '22 13:01 serathius

Not sure if this is the right place to ask, but when starting the logs say

detected custom v2store content. Etcd v3.5 is the last version allowing to access it using API v2. 
**Please remove the content.**

But I couldn't find how to do that. I removed all keys from the v2 store, but apparently there is still something stored in it. How do I remove all v2 content?

der-eismann avatar Aug 17 '22 11:08 der-eismann

Updated the top comment with the plan based on my discussion with @ptabor and work on deprecation. One difference is that I'm proposing to provide v3.6->v3.5 backward compatibility by using V3 state to generate V2 snapshots. This of cause needs proper testing, I have implemented some tests for this https://github.com/etcd-io/etcd/blob/main/tests/e2e/v2store_deprecation_test.go, but would be good to double check.

If someone is interested in helping, please let me know.

serathius avatar May 02 '23 18:05 serathius

I am interested in working on this issue. Investigating changes.

geetasg avatar May 02 '23 18:05 geetasg

provide v3.6->v3.5 backward compatibility by using V3 state to generate V2 snapshots

Please clarify this to avoid any confusion or misunderstanding. Are the "V2 snapshots" the *.snap files under the ${ETCD_DATA_DIR}/member/snap directory?

ahrtr avatar May 03 '23 02:05 ahrtr

provide v3.6->v3.5 backward compatibility by using V3 state to generate V2 snapshots

Please clarify this to avoid any confusion or misunderstanding. Are the "V2 snapshots" the *.snap files under the ${ETCD_DATA_DIR}/member/snap directory?

Yes, V2 snapshots are the *.snap files.

serathius avatar May 03 '23 08:05 serathius

provide v3.6->v3.5 backward compatibility by using V3 state to generate V2 snapshots

It is still not clear what does it exactly mean.

ahrtr avatar May 03 '23 08:05 ahrtr

Replace code that generates V2 snapshots, and modify it so it doesn't use v2 store, but v3 store.

serathius avatar May 03 '23 09:05 serathius

Replace code that generates V2 snapshots, and modify it so it doesn't use v2 store, but v3 store.

It's a big change. Have we considered completely removing the v2 snapshot files as well in 3.7? If yes, then the other option is to keep it as it's in 3.6, and remove both v2store and v2 snapshot files in 3.7. We can get all information form bbolt db, why do we need separate v2 snapshot files?

I am not saying which approach we must follow. Instead, I mean it's a big change, we can't simply to make change just based on one comment. Please evaluate the impact of all possible approaches and related effort.

ahrtr avatar May 03 '23 22:05 ahrtr

It's a big change. Have we considered completely removing the v2 snapshot files as well in 3.7?

It's a big change in sense how much it would simplify apply logic and number of lines of code could remove. However, it's not big change in sense of being hard to implement. Compared to most of the code/features, backward compatibility is much easier to test. Please check https://github.com/etcd-io/etcd/pull/13756

We can get all information form bbolt db, why do we need separate v2 snapshot files?

Backward compatibility with v3.5.

we can't simply to make change just based on one comment. Please evaluate the impact of all possible approaches and related effort.

It was not based on one comment. I have been working on removal of v2store for over a year. Sorry that not everything was documented back then, but I it was discussed between me and @ptabor a long time ago. Removing v2store early should have huge impact on simplifying the codebase and increasing project velocity. I think it's worth to do more for v3.6 release vs wait another 2 years for v3.7.

serathius avatar May 04 '23 07:05 serathius

It was not based on one comment. I have been working on removal of v2store for over a year.

I completely agree with the direction of removing v2store. I am just concerned about how was the decision "using V3 state to generate V2 snapshots" made. I do not see any detailed discussion or doc on that. Also we might also need to consider to completely remove v2 snapshots in future, but I also do not see any such discussion or items in the Plan.

ahrtr avatar May 04 '23 08:05 ahrtr

Adding my old attempt to implement this https://github.com/etcd-io/etcd/commit/2ddb174bc35d829a798310a27fc4ec8923a506bf

serathius avatar May 11 '23 17:05 serathius

Link to my current attempt - https://github.com/etcd-io/etcd/pull/16000 This borrows from https://github.com/etcd-io/etcd/commit/2ddb174bc35d829a798310a27fc4ec8923a506bf and also has code to export v2 snapshots for backward compatibility

geetasg avatar Jun 01 '23 17:06 geetasg

Related doc (for 3.6) - https://docs.google.com/document/d/19_7xBwZBNrM-75h5rr2PUWFM_eNjSXRVPZZpVaTS2qw/edit?usp=sharing

geetasg avatar Jun 13 '23 18:06 geetasg

The PRs will be roughly in following order -

  • [x] Update/add tests https://github.com/etcd-io/etcd/pull/16084 https://github.com/etcd-io/etcd/pull/16074 https://github.com/etcd-io/etcd/pull/16067 https://github.com/etcd-io/etcd/pull/16041

  • [x] Introduce method and its consumers to publish v2 snapshot from v3 state.Ref:https://github.com/etcd-io/etcd/pull/16000/files#diff-09b9c924aaa34d5ad5abc7e40ed10a2dcc5257c62fda537a8671a7f206d03e6aR831 https://github.com/etcd-io/etcd/pull/16100 https://github.com/etcd-io/etcd/pull/16132 https://github.com/etcd-io/etcd/pull/16418 https://github.com/etcd-io/etcd/pull/16376 https://github.com/etcd-io/etcd/pull/16441 https://github.com/etcd-io/etcd/pull/16455 https://github.com/etcd-io/etcd/pull/16460

  • [ ] Stop / remove use of SetStore method of RaftCluster in bootstrap. There should not be any caller of RaftCluster SetStore at this point. WIP - https://github.com/etcd-io/etcd/pull/16084 WIP - https://github.com/etcd-io/etcd/pull/16470

  • [ ] Remove v2 store from RaftCluster

  • [ ] Introduce applier_v2v3 and tests

  • [ ] Remove v2store from EtcdServer

  • [ ] Update tests dependent on v2 applier

  • [ ] Remove v2 applier

  • [ ] Update --experimental-no-v2=PHASE_1_WRITE_ONLY to write_only_drop_data

  • [ ] Update docs about write_only_drop_data

geetasg avatar Jun 15 '23 17:06 geetasg

Related doc (for 3.6) - https://docs.google.com/document/d/19_7xBwZBNrM-75h5rr2PUWFM_eNjSXRVPZZpVaTS2qw/edit?usp=sharing

Nice summary. @geetasg

The PRs will be roughly in following order -

The detailed plan seems not in sync with the summary above, and isn't clear to me either.

I think roughly what we need to do in 3.6 are (on top of both @serathius 's summary and your summary in the above doc) :

  • V2 snapshots are generated from V3 data I think this is the most important left task in 3.6. Please provide a detailed plan and steps or draft PR. Let me know if you need assistance.
  • V2 apply code is removed Just as you mentioned in the doc, we can NOT directly remove V2 apply code in 3.6 because we still need to handle 3.5's publish request in mix-version scenario. But we can simplify the V2 apply code.
  • V2 no longer stores membership data To be clearer, we still need to store the membership data in the snapshot files (.snap), we just do not need the data in v2store in memory, because the data will come from the bbolt db.
  • Remove the etcdutl backup? It has some code related to v2store. I don't understand why do we need such command. It has already been deprecated in 3.5 and users are recommended to use etcdctl snapshot save command. So can we just remove it in 3.6? At the latest, we should remove it in 3.7. WDYT? @ptabor @serathius

ahrtr avatar Jul 24 '23 14:07 ahrtr

@ahrtr Thank you for the review. Plan for "V2 snapshots are generated from V3 data" is listed below - 1.Provide a method to export membership in v2 format - Ref:https://github.com/etcd-io/etcd/pull/16132 2.Consume the method implemented in above step from triggerSnapshot workflow as in poc here - https://github.com/geetasg/etcd/blob/poc1/server/etcdserver/server.go#L2055 3.Similarly consume the same method from etcdutl backup as shown here - https://github.com/geetasg/etcd/blob/poc1/etcdutl/etcdutl/backup_command.go#L184 4.Consume the method from etcdutl snapshot Ref:https://github.com/geetasg/etcd/blob/poc1/etcdutl/snapshot/v3_snapshot.go#L461 5.Consume the method from workflow to created merged snapshot Ref:https://github.com/geetasg/etcd/blob/poc1/server/etcdserver/snapshot_merge.go#L34

This poc is at branch poc1 in my repo. Please let me know if you have questions/suggestions on this plan. I am currently at step 1 - PR is published for review.
/cc @serathius @ptabor

geetasg avatar Jul 24 '23 19:07 geetasg

Also listing the plan for other items - V2 no longer stores membership data 1.Fix conf change entry validation - https://github.com/geetasg/etcd/tree/pr3 2.Remove v2store from RaftCluster - ref:https://github.com/etcd-io/etcd/pull/16000/commits/8dcc34d5f2609e6d223ebfd58225a91d1a54a9be - some of the earlier commits leading upto this removal will count too. 3.Remove v2store from bootstrap - Ref: https://github.com/etcd-io/etcd/pull/16000/commits/930ecd28bbd4dc63b89aff26e28e547ec5b056a6 and https://github.com/etcd-io/etcd/pull/16000/commits/9b0dd9760a81a5b9d8a6ed009f3d668dc7350ef5

Currently this work is on #1 - PR is being discussed..

geetasg avatar Jul 24 '23 19:07 geetasg

Plan for V2 apply code is removed - 1.This will involve introducing a new applier which can apply v2 req to v3 backend. Ref:https://github.com/etcd-io/etcd/pull/16000/commits/733d109241bcf60c590fce9fd04f1e98d710dbfa 2.Update tests - there are unit tests in server_test.go that issue v2 requests. Ref: TestDoLocalAction 3.Remove v2 applier - https://github.com/etcd-io/etcd/pull/16000/commits/9c02d763d4b9859b7a648f3fd1050d126e84fcc1

v2v3 applier can be removed once the 3.5 publish node request is n/a.

geetasg avatar Jul 24 '23 19:07 geetasg

Remove the etcdutl backup? makes sense to me. It can be removed if there are no objections.

geetasg avatar Jul 24 '23 19:07 geetasg

@ahrtr @serathius @ptabor please let me know if you have any questions/comments on the plan listed above. These changes are poced in https://github.com/etcd-io/etcd/pull/16000. Thanks

geetasg avatar Jul 24 '23 19:07 geetasg

Overall looks good to me, let's do it step by step. StoreV2 deprecation is the top priority for etcd 3.6.0 per https://github.com/etcd-io/etcd/pull/16279

ahrtr avatar Jul 25 '23 12:07 ahrtr

@geetasg Could you summarize all remaining tasks in one comment using a hierarchical structure to clearly show what has been completed and what has not? thx

ahrtr avatar Aug 30 '23 09:08 ahrtr