cockroach
cockroach copied to clipboard
kv: verify roachpb.Value checksums when building snapshots
This commit verify record checksum verification when processing raft snapshot records. This process prevents data corruption spread via snapshot process. Cluster setting snapshotChecksumVerification is introduced to act as a safeguard knob to turn on/off the verifcation.
Fixes: https://github.com/cockroachdb/cockroach/issues/110572
Release note: None
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?
Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR.
My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.
I have added a few people who may be able to assist in reviewing:
- @sumeerbhola (commented on cockroachdb/cockroach#110572)
- @jbowens (commented on cockroachdb/cockroach#110572)
- @erikgrinaker (author of cockroachdb/cockroach#110572)
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.
Thank you for updating your pull request.
My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?
Thank you for updating your pull request.
My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.
I digged more into the test failure with @pav-kv, we added prints to mvccput and found the orginal value being put is a SimpleMVCCValue
key: /Local/RangeID/68/r/RangeGCThreshold
value: [7a 00 c9 92 03]
proto: 0,0
checksum: 7a00c992
computed checksum: 7a00c992
the value matches the suffix of the error log in the checksum verification
[12 04 08 00 10 00 18 00 20 00 28 00 32 05 7a 00 c9 92 03]
which leads us to think it become an ExtendedMVCCValue
in the raft snaphot.
The confusion part is decode simple mvcc value would work on the [12 04 08 00 10 00 18 00 20 00 28 00 32 05 7a 00 c9 92 03]
value and 12 04 08 00
leads to wrong checksum. I am expecting it to return false and redirect to decodeExtendedMVCCValue(r.Value())
. I think Erik is ooo, hi @jbowens do you know a better way to decode the value from
batchreader?
Previous thread is resolved, I wrapped decoding logic under the if mvccKey.IsValue()
just like here test passes, I guess mvcckey without a timestamp is another category of mvccdata.
if mvccKey.IsValue() {
mvccValue, ok, err := tryDecodeSimpleMVCCValue(r.Value())
if !ok && err == nil {
mvccValue, err = decodeExtendedMVCCValue(r.Value())
}
if err == nil {
err = mvccValue.Value.Verify(mvccKey.Key)
}
}
pkg/kv/kvserver/store_snapshot.go
line 555 at r7 (raw file):
Previously, pav-kv (Pavel Kalinnikov) wrote…
A bunch of func calls below (e.g.
PutInternalRangeKey
) will decode theEngineKey
internally. I wonder if the APIs should acceptEngineKey
instead of the raw one. Probably not to be addressed in PR.
left a todo on this, personally I think all the method below can take a decoded engine key as input
unfortunately this pr experienced the same test failures will wait for the fix of 121342 then rebase and merge https://github.com/cockroachdb/cockroach/issues/121342
bors r+