etcd icon indicating copy to clipboard operation
etcd copied to clipboard

Add more gofailpoints to validate critical parts of the code

Open serathius opened this issue 2 years ago • 12 comments

What would you like to be added?

Followup from https://github.com/etcd-io/etcd/pull/14733. Thanks to gofailpoint in defrag code we were able to find an issue with data inconsistency.

Based on this we should increase coverage of other processes similar to defrag. TODO:

  • Create a list of processes to add gofailpoints
  • Add the gofailpoint into the code. Example: https://github.com/etcd-io/etcd/blob/c83b1ad9bad637fd02c6e17010755c9b35860bd7/server/storage/backend/backend.go#L504
  • Add code to linearizability tests to trigger the failpoints. Example https://github.com/etcd-io/etcd/blob/2f558ca0dbf1217c1cc5b82a1e2aec428bf6a04f/tests/linearizability/failpoints.go#L35-L44

Let's start from creating the list:

  • [x] Compact https://github.com/etcd-io/etcd/pull/14765
  • [ ] AuthEnable/Disable
  • [x] ConsistenIndex
  • [ ] MoveLeader
  • [x] BackendHooks https://github.com/etcd-io/etcd/pull/14746
  • [ ] Downgrade
  • [x] Backend Transaction (https://github.com/etcd-io/etcd/pull/14757)
  • [ ] Watch

Feel free to suggest other processes that require a gofailpoint. cc @ahrtr @ptabor

Why is this needed?

This should allow us to better detect issues with data inconsistency like https://github.com/etcd-io/etcd/pull/14733

serathius avatar Nov 11 '22 10:11 serathius

I will provide a list next week.

ahrtr avatar Nov 11 '22 10:11 ahrtr

Supplement:

  • downgrade
  • Moveleader
  • backend (I will add some failpoints)

ahrtr avatar Nov 14 '22 22:11 ahrtr

Would love to also get some suggestions from @tbg @pavelkalinnikov. Raft library already has couple of failpoints, still it might be worth to expand them, especially with https://github.com/etcd-io/etcd/pull/14627.

serathius avatar Nov 21 '22 21:11 serathius

@serathius I've checked ConsistenIndex and I think the most interesting use case for failpoint is UnsafeSave, but it's indirectly covered by PreCommitHook failpoints. Do you have something else in mind for ConsistenIndex? If not, I think it can be crossed out.

lavacat avatar Jan 03 '23 09:01 lavacat

@ahrtr Checked Moveleader and it seams most of the work is happening inside raft lib. I think it will be useful to add MoveLeader call to DefaultTraffic. But I don't see a good place inside etcd repo for a new failpoint. Do you have something else in mind for MoveLeader failpoint?

lavacat avatar Jan 03 '23 09:01 lavacat

@serathius I've checked ConsistenIndex and I think the most interesting use case for failpoint is UnsafeSave, but it's indirectly covered by PreCommitHook failpoints. Do you have something else in mind for ConsistenIndex? If not, I think it can be crossed out.

No, list is to ensure we give some thought to it. Thanks for confirming.

serathius avatar Jan 09 '23 10:01 serathius

I think we should also cover watch with failpoints. First step would be to reproduce https://github.com/etcd-io/etcd/pull/15248 by adding a failpoint after sending a progress notify. cc @ahrtr

serathius avatar Feb 09 '23 21:02 serathius

@ahrtr Checked Moveleader and it seams most of the work is happening inside raft lib. I think it will be useful to add MoveLeader call to DefaultTraffic. But I don't see a good place inside etcd repo for a new failpoint. Do you have something else in mind for MoveLeader failpoint?

Sorry for the late response. We can use newLeader to check whether the leader changed, and add failpoint accordingly.

I think we should also cover watch with failpoints. First step would be to reproduce #15248 by adding a failpoint after sending a progress notify. cc @ahrtr

This seems to be a good suggestion. But I still do not think it can reliably reproduce the duplicated event issue. Submitted a PR https://github.com/etcd-io/etcd/pull/15272.

ahrtr avatar Feb 09 '23 22:02 ahrtr

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar May 21 '23 17:05 stale[bot]

If no one else is working on this issue, is it fine if I work on it @serathius ?

mounilKshah avatar Aug 19 '24 01:08 mounilKshah

This issue requires some research into best places to put failpoints. This might require knowledge from maintainers, that might not be available and not able to provide timely responses.

@mounilKshah Could you look into https://github.com/etcd-io/etcd/issues/17817 ? It's very similar issue, but about backporting existing failpoints, so there is no need for maintainers help and should allow you to dig into the problem.

serathius avatar Aug 19 '24 07:08 serathius

Yeah sure, thanks for the quick response!

mounilKshah avatar Aug 20 '24 04:08 mounilKshah