etcd
etcd copied to clipboard
Fix tx buffer inconsistency if there are unordered key writes in one tx.
Recommit https://github.com/etcd-io/etcd/pull/17228 with buf fix.
https://github.com/etcd-io/etcd/issues/17247 was fixed by reverting https://github.com/etcd-io/etcd/pull/17228. But the underlying problem with the buffer should still be fixed.
No failure locally with
EXPECT_DEBUG=true GO_TEST_FLAGS='-run=TestRobustness/Kubernetes/HighTraffic/ClusterOfSize3 --timeout=3000m --count=100 --failfast -v' RESULTS_DIR=./tmp/results make test-robustness
and just the [RaftAfterSaveSnapPanic, MemberReplace]
Failpoints.
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.
/retest
Hi @siyuanfoundation you can try to limit cpu by the following command and use patch #17248
EXPECT_DEBUG=true GO_TEST_FLAGS=${GO_TEST_FLAGS} RESULTS_DIR=/tmp/results taskset -c 0,1,2 make test-robustness
Hope it can help.
Hi @siyuanfoundation you can try to limit cpu by the following command and use patch #17248
EXPECT_DEBUG=true GO_TEST_FLAGS=${GO_TEST_FLAGS} RESULTS_DIR=/tmp/results taskset -c 0,1,2 make test-robustness
Hope it can help.
Thank you @fuweid, it did help!
cc @serathius @ahrtr @fuweid
Please read https://github.com/etcd-io/etcd/issues/17247#issuecomment-1895725142
cc @ahrtr @serathius
/retest
The change looks good to me, it fixes a clear problem and adds regression tests, however https://github.com/etcd-io/etcd/pull/17228 did the same. Can we take some additional steps to ensure we prevent another regression?
I don't we can depend on reviewers trying to analyse and find edge cases as it failed before. One idea I have is to do comparison testing between buffered and unbuffered transaction. What do you think?
cc @ahrtr
Added benchmark test for writeback
function
Ran the test 3 times. The results are kind of noisy, but there is no clear indication of performance regression.
Before the PR:
BenchmarkWritebackSeqBatches1BatchSize10000-8 100 132904368, 137525118, 134134904 ns/op
BenchmarkWritebackSeqBatches10BatchSize1000-8 100 135975057, 144034473, 138393072 ns/op
BenchmarkWritebackSeqBatches100BatchSize100-8 100 145159646, 147396648, 148373301 ns/op
BenchmarkWritebackSeqBatches1000BatchSize10-8 100 225271045, 225629845, 229453516 ns/op
BenchmarkWritebackNonSeqBatches1000BatchSize1-8 100 29684442, 29051643, 31323272 ns/op
BenchmarkWritebackNonSeqBatches10000BatchSize1-8 100 1011180750, 1012368726, 1013996570 ns/op
BenchmarkWritebackNonSeqBatches100BatchSize10-8 100 22771697, 22785210, 22529997 ns/op
BenchmarkWritebackNonSeqBatches1000BatchSize10-8 100 233131758, 233625789, 238527133 ns/op
After the PR:
enchmarkWritebackSeqBatches1BatchSize10000-8 100 136669713, 131847663, 130722632 ns/op
BenchmarkWritebackSeqBatches10BatchSize1000-8 100 139880124, 139196486, 137805060 ns/op
BenchmarkWritebackSeqBatches100BatchSize100-8 100 150674477, 146649422, 146416992 ns/op
BenchmarkWritebackSeqBatches1000BatchSize10-8 100 226596374, 228097908, 223997282 ns/op
BenchmarkWritebackNonSeqBatches1000BatchSize1-8 100 30098653, 30031651, 29341949 ns/op
BenchmarkWritebackNonSeqBatches10000BatchSize1-8 100 990611851, 999573564, 994189677 ns/op
BenchmarkWritebackNonSeqBatches100BatchSize10-8 100 22655668, 22641929, 21832902 ns/op
BenchmarkWritebackNonSeqBatches1000BatchSize10-8 100 234623372, 233671979, 231825643 ns/op
The change looks good to me, it fixes a clear problem and adds regression tests, however #17228 did the same. Can we take some additional steps to ensure we prevent another regression?
I don't we can depend on reviewers trying to analyse and find edge cases as it failed before. One idea I have is to do comparison testing between buffered and unbuffered transaction. What do you think?
cc @serathius
This bug happened because of the coincidence of Commit()
timing between two different buckets, like in the tests I had to do two consecutive tx.Commit()
to reproduce the bug. It is very hard to ensure there is no regression deterministically. In this case, we just have to be careful with more unit tests and rely on non-deterministic robustness test to detect the regression.
Ran the test 3 times. The results are kind of noisy, but there is no clear indication of performance regression. Before the PR:
Can you use benchstat ? https://sourcegraph.com/blog/go/gophercon-2019-optimizing-go-code-without-a-blindfold
Ran the test 3 times. The results are kind of noisy, but there is no clear indication of performance regression. Before the PR:
Can you use benchstat ? https://sourcegraph.com/blog/go/gophercon-2019-optimizing-go-code-without-a-blindfold
Ok. I ran
go clean -testcache && go test -bench=. -count=10 -timeout 0 > benchmark_results.txt
benchstat benchmark_results.txt
Here are the results: (they are basically the same) Before
WritebackSeqBatches1BatchSize10000-8 128.2m ± 1%
WritebackSeqBatches10BatchSize1000-8 130.7m ± 1%
WritebackSeqBatches100BatchSize100-8 139.5m ± 1%
WritebackSeqBatches1000BatchSize10-8 220.6m ± 4%
WritebackNonSeqBatches1000BatchSize1-8 28.34m ± 2%
WritebackNonSeqBatches10000BatchSize1-8 1.002 ± 1%
WritebackNonSeqBatches100BatchSize10-8 21.64m ± 2%
WritebackNonSeqBatches1000BatchSize10-8 226.3m ± 1%
After
WritebackSeqBatches1BatchSize10000-8 128.3m ± 1%
WritebackSeqBatches10BatchSize1000-8 130.8m ± 3%
WritebackSeqBatches100BatchSize100-8 140.0m ± 1%
WritebackSeqBatches1000BatchSize10-8 216.3m ± 4%
WritebackNonSeqBatches1000BatchSize1-8 27.49m ± 2%
WritebackNonSeqBatches10000BatchSize1-8 966.6m ± 1%
WritebackNonSeqBatches100BatchSize10-8 20.47m ± 2%
WritebackNonSeqBatches1000BatchSize10-8 221.3m ± 1%
Here are the results: (they are basically the same)
Benchstat do not change the results, it just makes them readable and makes it easy to check if they are statistically significant
If you provide both old and new benchmark results to benchstat, it will provide also the performance delta and p-value
Here are the results: (they are basically the same)
Benchstat do not change the results, it just makes them readable and makes it easy to check if they are statistically significant
If you provide both old and new benchmark results to benchstat, it will provide also the performance delta and p-value
Reran benchstat twice with count=10, run 1:
│ benchmark_before.txt │ benchmark_after.txt │
│ sec/op │ sec/op vs base │
WritebackSeqBatches1BatchSize10000-8 127.7m ± 1% 127.2m ± 3% ~ (p=0.796 n=10)
WritebackSeqBatches10BatchSize1000-8 129.9m ± 6% 129.9m ± 1% ~ (p=0.971 n=10)
WritebackSeqBatches100BatchSize100-8 138.6m ± 1% 137.0m ± 1% -1.17% (p=0.029 n=10)
WritebackSeqBatches1000BatchSize10-8 215.7m ± 1% 212.7m ± 4% ~ (p=0.052 n=10)
WritebackNonSeqBatches1000BatchSize1-8 28.48m ± 2% 28.05m ± 2% -1.48% (p=0.035 n=10)
WritebackNonSeqBatches10000BatchSize1-8 981.0m ± 1% 949.6m ± 0% -3.20% (p=0.000 n=10)
WritebackNonSeqBatches100BatchSize10-8 20.34m ± 1% 21.21m ± 5% +4.27% (p=0.000 n=10)
WritebackNonSeqBatches1000BatchSize10-8 216.3m ± 1% 216.4m ± 4% ~ (p=0.796 n=10)
run 2:
│ benchmark_before_0.txt │ benchmark_after_0.txt │
│ sec/op │ sec/op vs base │
WritebackSeqBatches1BatchSize10000-8 127.8m ± 3% 130.0m ± 2% +1.73% (p=0.023 n=10)
WritebackSeqBatches10BatchSize1000-8 127.2m ± 1% 130.9m ± 3% +2.95% (p=0.001 n=10)
WritebackSeqBatches100BatchSize100-8 135.4m ± 1% 139.2m ± 3% +2.76% (p=0.001 n=10)
WritebackSeqBatches1000BatchSize10-8 212.4m ± 2% 213.1m ± 0% ~ (p=0.165 n=10)
WritebackNonSeqBatches1000BatchSize1-8 27.82m ± 2% 28.80m ± 3% +3.51% (p=0.001 n=10)
WritebackNonSeqBatches10000BatchSize1-8 970.1m ± 2% 951.8m ± 1% -1.89% (p=0.000 n=10)
WritebackNonSeqBatches100BatchSize10-8 20.23m ± 1% 20.39m ± 5% ~ (p=0.165 n=10)
WritebackNonSeqBatches1000BatchSize10-8 217.1m ± 1% 216.6m ± 1% ~ (p=0.529 n=10)
So in the updated benchmark results you can see that it causes ~2% regression in performance. Still I think impact on overall etcd performance should be not noticable and we should pick correctness here.