etcd icon indicating copy to clipboard operation
etcd copied to clipboard

Add hashKV checking to robustness test

Open serathius opened this issue 1 year ago • 13 comments

What would you like to be added?

https://github.com/etcd-io/etcd/pull/18274 proposes to change very delicate part of compaction to fix a correctness bug. There is high concern in this change potentially changing result of hashKV which can lead to cluster incorrectly marking itself as corrupted. We need to build a very solid confidence in hashKV calculation which regards to traffic and compaction.

Robustness tests seem like a good candidate to build that confidence even further, the main challenge in adding tests like in https://github.com/etcd-io/etcd/pull/18369 is fact that we cannot easily cover all ways requests (PUT/DELETE/TXN/Compaction) can interleave. With Robustness generating random requests it seems that we just need to start checking hashKV.

Robustness tests could be periodically requesting hashKV from all members and validating them for equality.

Why is this needed?

Would like to generalize testing for https://github.com/etcd-io/etcd/pull/18369.

serathius avatar Jul 31 '24 07:07 serathius

cc @ahrtr @fuweid @henrybear327 @jmhbnz @siyuanfoundation @ah8ad3

serathius avatar Jul 31 '24 07:07 serathius

Three cases I can think of,

  • Always check hash at the end of each test, to verify all members have exactly the same hash value on the key space. It's exactly what the previous functional test was doing.
  • Compute the hash against the compact revision after each compaction.
  • Randomly compute hash on a random historic revision during the test.

But we need to exclude false alarm. The compaction is executed in the applying workflow/path, but hashKV is executed in the API layer. The CompactRevision must be the same in the HashKVResponse of all members (the same for HashRevision), otherwise it makes no sense to compare the Hash.

https://github.com/etcd-io/etcd/blob/739a9b60b07eaa519428a6c1f364381e20141c52/api/etcdserverpb/rpc.pb.go#L1634-L1638

ahrtr avatar Jul 31 '24 09:07 ahrtr

@ArkaSaha30

ahrtr avatar Jul 31 '24 09:07 ahrtr

/assign @ArkaSaha30 /assign @henrybear327

We would like to work on this together as this also seems to be a good material for the talk that we might be preparing for!

henrybear327 avatar Jul 31 '24 15:07 henrybear327

@henrybear327: GitHub didn't allow me to assign the following users: ArkaSaha30.

Note that only etcd-io members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to this:

/assign @ArkaSaha30 /assign @henrybear327

We would like to work on this together as this also seems to be a good material for the talk that we might be preparing for!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jul 31 '24 15:07 k8s-ci-robot

/assign

ArkaSaha30 avatar Jul 31 '24 17:07 ArkaSaha30

Sound good to me! Especially, it can help us to do sanity and compatible check on mix-versions cluster.

fuweid avatar Aug 01 '24 10:08 fuweid

I don't know if anyone needs an extra pair of hands i'm here just ping me @ArkaSaha30 @henrybear327

ah8ad3 avatar Aug 05 '24 04:08 ah8ad3

ping @henrybear327 @ArkaSaha30 any progress?

serathius avatar Oct 26 '24 09:10 serathius

As discussed offline with Marek, we will implement the 1st check for now.

As for the second and third cases, we would look into the existing corruption check based on HashKV and see how we can reuse/improve the logic there.

PR for case 1: https://github.com/etcd-io/etcd/pull/19680

Three cases I can think of,

  • Always check hash at the end of each test, to verify all members have exactly the same hash value on the key space. It's exactly what the previous functional test was doing.
  • Compute the hash against the compact revision after each compaction.
  • Randomly compute hash on a random historic revision during the test.

henrybear327 avatar Mar 26 '25 09:03 henrybear327

Three cases I can think of,

  • Always check hash at the end of each test, to verify all members have exactly the same hash value on the key space. It's exactly what the previous functional test was doing.
  • Compute the hash against the compact revision after each compaction.
  • Randomly compute hash on a random historic revision during the test.

But we need to exclude false alarm. The compaction is executed in the applying workflow/path, but hashKV is executed in the API layer. The CompactRevision must be the same in the HashKVResponse of all members (the same for HashRevision), otherwise it makes no sense to compare the Hash.

etcd/api/etcdserverpb/rpc.pb.go

Lines 1634 to 1638 in 739a9b6

Hash uint32 protobuf:"varint,2,opt,name=hash,proto3" json:"hash,omitempty" // compact_revision is the compacted revision of key-value store when hash begins. CompactRevision int64 protobuf:"varint,3,opt,name=compact_revision,json=compactRevision,proto3" json:"compact_revision,omitempty" // hash_revision is the revision up to which the hash is calculated. HashRevision int64 protobuf:"varint,4,opt,name=hash_revision,json=hashRevision,proto3" json:"hash_revision,omitempty"

For case 2 and 3, how would we want to proceed? @serathius @ahrtr

Should we look into reusing the logic from corrupt.go? Or we should code something up from scratch. IMO, corrupt.go contains quite some similar logic to what we are trying to do here already.

henrybear327 avatar Apr 15 '25 13:04 henrybear327

Right, it doesn't make sense to have custom corruption validation. Would be good to extract some methods that are useful for robustness tests and reuse that logic.

serathius avatar Apr 15 '25 15:04 serathius

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jun 15 '25 00:06 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 23 '25 00:08 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Oct 23 '25 00:10 github-actions[bot]