etcd icon indicating copy to clipboard operation
etcd copied to clipboard

etcdserver: fix corruption check when server has just been compacted

Open jbml opened this issue 3 years ago • 5 comments

When a key-value store corruption check happens immediately after a compaction, the revision at which the key-value store hash is computed, is the compacted revision itself. In that case, the hash computation logic was incorrect because it returned an ErrCompacted error; this error should instead be returned when the revision at which the key-value store is hashed, is strictly lower than the compacted revision.

Fixes #14325

Signed-off-by: Jeremy Leach [email protected]

jbml avatar Sep 13 '22 12:09 jbml

My issues with the proposed change:

  • It makes HashKV revision semantics inconsistent with rest of API. Compaction removes etcd history up to and including specified revision. This operation is inclusive, meaning it also removes the changes from the revision. This is why you cannot request neither HashKV nor Range or revision that was compacted. This change would make it possible to get Hash for revision that is no longer accessible rest of API.
  • Impact is negligible as provided scenario is not realistic. I don't expect that in production clusters someone will compact up to the latest revision and than don't do any updates for long enough for corruption check to run. Periodic corruption check is pretty heavy operation so I also don't expect anyone to set it for 1 minute period. For example proposed period was 12h https://github.com/etcd-io/etcd/blob/main/CHANGELOG/CHANGELOG-4.0.md#etcd-server
  • Periodic check is known to be not reliable. We have already implemented a replacement. https://github.com/etcd-io/etcd/issues/14039. I think we can consider deprecating the old check when we have enough production data.

serathius avatar Sep 13 '22 14:09 serathius

Thank you for reviewing the PR @serathius

  • It makes HashKV revision semantics inconsistent with rest of API. Compaction removes etcd history up to and including specified revision. This operation is inclusive, meaning it also removes the changes from the revision. This is why you cannot request neither HashKV nor Range or revision that was compacted. This change would make it possible to get Hash for revision that is no longer accessible rest of API.

The intention here was actually to make the HashByRev API consistent with Range, which I believe is what the author of #14325 meant in https://github.com/etcd-io/etcd/issues/14325#issuecomment-1210717339 when he/she said:

It's a surprise to me that after compaction, the latest revision is not available to be queried at. Does that seem like the expected behavior? Based on testing with just etcdctl that doesn't seem to be the case.

I believe the following test illustrates this point. Start with a fresh etcd cluster:

> ./bin/etcdctl put foo bar1
> ./bin/etcdctl put foo bar2
> ./bin/etcdctl put foo bar3
> ./bin/etcdctl put foo bar4
> ./bin/etcdctl put foo bar5

>. /tools/etcd-dump-db/etcd-dump-db iterate-bucket  ~/projects/etcd/data/uniquenode.etcd key --decode
rev={main:6 sub:0}, value=[key "foo" | val "bar5" | created 2 | mod 6 | ver 5]
rev={main:5 sub:0}, value=[key "foo" | val "bar4" | created 2 | mod 5 | ver 4]
rev={main:4 sub:0}, value=[key "foo" | val "bar3" | created 2 | mod 4 | ver 3]
rev={main:3 sub:0}, value=[key "foo" | val "bar2" | created 2 | mod 3 | ver 2]
rev={main:2 sub:0}, value=[key "foo" | val "bar1" | created 2 | mod 2 | ver 1]

Then compact it at revision 5:

> ./bin/etcdctl compact 5
compacted revision 5

./tools/etcd-dump-db/etcd-dump-db iterate-bucket  ~/projects/etcd/data/uniquenode.etcd key --decode
rev={main:6 sub:0}, value=[key "foo" | val "bar5" | created 2 | mod 6 | ver 5]
rev={main:5 sub:0}, value=[key "foo" | val "bar4" | created 2 | mod 5 | ver 4]

Then check what revisions are accessible with Range:

> ./bin/etcdctl --rev=4 get foo
{"level":"warn","ts":"2022-09-18T16:19:51.102+1000","logger":"etcd-client","caller":"v3/retry_interceptor.go:64",
"msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00000a780/127.0.0.1:2379",
"method":"/etcdserverpb.KV/Range","attempt":0,"error":"rpc error: code = OutOfRange 
desc = etcdserver: mvcc: required revision has been compacted"}
Error: etcdserver: mvcc: required revision has been compacted
> ./bin/etcdctl --rev=5 get foo
foo
bar4
> ./bin/etcdctl --rev=6 get foo
foo
bar5

As the version at which the compaction is done (5) is still available, it made sense to me that HashByRev raising an "required revision has been compacted" error when called on the revision, was indeed an issue.

This inconsistency is what we can see when we compare this for the HashByRev API: https://github.com/etcd-io/etcd/blob/e61d43179291562c44225bf49a6d97c52a65fdca/server/storage/mvcc/kvstore.go#L180-L182 and this for the Range API: https://github.com/etcd-io/etcd/blob/fff5d00ccfddaf5f06fa11cad65c1ecaa8c09f87/server/storage/mvcc/kvstore_txn.go#L74-L76 with the comparison being "<=" on one side, and "<" on the other side.

  • Impact is negligible as provided scenario is not realistic. I don't expect that in production clusters someone will compact up to the latest revision and than don't do any updates for long enough for corruption check to run. Periodic corruption check is pretty heavy operation so I also don't expect anyone to set it for 1 minute period. For example proposed period was 12h https://github.com/etcd-io/etcd/blob/main/CHANGELOG/CHANGELOG-4.0.md#etcd-server

Fully agree that this is a rare case.

Fair enough.

jbml avatar Sep 18 '22 07:09 jbml

@serathius : would you suggest we simply close the PR and associated issue, please ?

jbml avatar Sep 22 '22 05:09 jbml

The intention here was actually to make the HashByRev API consistent with Range

Sorry for mistaking it there. I think there is some value in making API consistent, however from top of my head I don't know if this change doesn't negatively impact any of existing mechanisms using HashKV. @ahrtr what's your opinion?

serathius avatar Sep 22 '22 09:09 serathius

The intention here was actually to make the HashByRev API consistent with Range

Sorry for mistaking it there. I think there is some value in making API consistent, however from top of my head I don't know if this change doesn't negatively impact any of existing mechanisms using HashKV. @ahrtr what's your opinion?

I will have a closer look at this PR and related issue late today or tomorrow morning my time.

ahrtr avatar Sep 22 '22 09:09 ahrtr

I think this is a valid PR.

When the leader performs the periodic check , it calculates the hash up to the currentRev. The reason why it can pass the check below is that the rev is 0.

https://github.com/etcd-io/etcd/blob/54f9483e725f7300440fcabd9fbc4b94c81da3bd/server/storage/mvcc/kvstore.go#L180

But when leader uses the same currentRev to get peers' hashes, it fails to pass the same check above, because the rev isn't 0 anymore. It doesn't make sense. It's exactly what this PR tries to fix.

ahrtr avatar Sep 23 '22 05:09 ahrtr

Please note that when I review this PR, I found a related issue, please see https://github.com/etcd-io/etcd/pull/14510

ahrtr avatar Sep 23 '22 05:09 ahrtr

Overall looks good to me, and I don't see any negative impact.

I don't know if this change doesn't negatively impact any of existing mechanisms using HashKV

ahrtr avatar Sep 24 '22 22:09 ahrtr

Hello @serathius , Do you mind reviewing the PR when time permits please ?

jbml avatar Sep 27 '22 13:09 jbml

cc @spzala @ptabor @serathius @mitake

ahrtr avatar Oct 13 '22 02:10 ahrtr

@jbml Could you backport this PR to 3.5 and 3.4 if needed?

ahrtr avatar Oct 13 '22 07:10 ahrtr

I found this still do not backport to v3.4 and v3.5. I would help to do this.

kkkkun avatar Jun 11 '23 13:06 kkkkun