etcd
etcd copied to clipboard
Add more gofailpoints to validate critical parts of the code
What would you like to be added?
Followup from https://github.com/etcd-io/etcd/pull/14733. Thanks to gofailpoint in defrag code we were able to find an issue with data inconsistency.
Based on this we should increase coverage of other processes similar to defrag. TODO:
- Create a list of processes to add gofailpoints
- Add the gofailpoint into the code. Example: https://github.com/etcd-io/etcd/blob/c83b1ad9bad637fd02c6e17010755c9b35860bd7/server/storage/backend/backend.go#L504
- Add code to linearizability tests to trigger the failpoints. Example https://github.com/etcd-io/etcd/blob/2f558ca0dbf1217c1cc5b82a1e2aec428bf6a04f/tests/linearizability/failpoints.go#L35-L44
Let's start from creating the list:
- [x] Compact https://github.com/etcd-io/etcd/pull/14765
- [ ] AuthEnable/Disable
- [x] ConsistenIndex
- [ ] MoveLeader
- [x] BackendHooks https://github.com/etcd-io/etcd/pull/14746
- [ ] Downgrade
- [x] Backend Transaction (https://github.com/etcd-io/etcd/pull/14757)
- [ ] Watch
Feel free to suggest other processes that require a gofailpoint. cc @ahrtr @ptabor
Why is this needed?
This should allow us to better detect issues with data inconsistency like https://github.com/etcd-io/etcd/pull/14733
I will provide a list next week.
Supplement:
- downgrade
- Moveleader
- backend (I will add some failpoints)
Would love to also get some suggestions from @tbg @pavelkalinnikov. Raft library already has couple of failpoints, still it might be worth to expand them, especially with https://github.com/etcd-io/etcd/pull/14627.
@serathius I've checked ConsistenIndex and I think the most interesting use case for failpoint is UnsafeSave, but it's indirectly covered by PreCommitHook failpoints. Do you have something else in mind for ConsistenIndex? If not, I think it can be crossed out.
@ahrtr Checked Moveleader and it seams most of the work is happening inside raft lib. I think it will be useful to add MoveLeader call to DefaultTraffic. But I don't see a good place inside etcd repo for a new failpoint. Do you have something else in mind for MoveLeader failpoint?
@serathius I've checked ConsistenIndex and I think the most interesting use case for failpoint is UnsafeSave, but it's indirectly covered by PreCommitHook failpoints. Do you have something else in mind for ConsistenIndex? If not, I think it can be crossed out.
No, list is to ensure we give some thought to it. Thanks for confirming.
I think we should also cover watch with failpoints. First step would be to reproduce https://github.com/etcd-io/etcd/pull/15248 by adding a failpoint after sending a progress notify. cc @ahrtr
@ahrtr Checked Moveleader and it seams most of the work is happening inside raft lib. I think it will be useful to add MoveLeader call to DefaultTraffic. But I don't see a good place inside etcd repo for a new failpoint. Do you have something else in mind for MoveLeader failpoint?
Sorry for the late response. We can use newLeader to check whether the leader changed, and add failpoint accordingly.
I think we should also cover watch with failpoints. First step would be to reproduce #15248 by adding a failpoint after sending a progress notify. cc @ahrtr
This seems to be a good suggestion. But I still do not think it can reliably reproduce the duplicated event issue. Submitted a PR https://github.com/etcd-io/etcd/pull/15272.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
If no one else is working on this issue, is it fine if I work on it @serathius ?
This issue requires some research into best places to put failpoints. This might require knowledge from maintainers, that might not be available and not able to provide timely responses.
@mounilKshah Could you look into https://github.com/etcd-io/etcd/issues/17817 ? It's very similar issue, but about backporting existing failpoints, so there is no need for maintainers help and should allow you to dig into the problem.
Yeah sure, thanks for the quick response!