rangefeed: likely data race in checkpoint marshaling
From Sentry and now an escalation:
- 24.3: https://github.com/cockroachdb/cockroach/issues/158166
- 24.1: https://github.com/cockroachdb/cockroach/issues/157882
- escalation on 24.1: https://github.com/cockroachlabs/support/issues/3506
- cloud logs^1
We ocasionally see crashes when a MuxRangeFeedEvent is marshaled as part of sending over the wire.
runtime error: index out of range [-1]
(1) attached stack trace
-- stack trace:
| runtime.gopanic
| GOROOT/src/runtime/panic.go:770
| runtime.goPanicIndex
| GOROOT/src/runtime/panic.go:114
| github.com/cockroachdb/cockroach/pkg/kv/kvpb.encodeVarintApi
| github.com/cockroachdb/cockroach/pkg/kv/kvpb/bazel-out/k8-opt/bin/pkg/kv/kvpb/kvpb_go_proto_/github.com/cockroachdb/cockroach/pkg/kv/kvpb/api.pb.go:21220
| github.com/cockroachdb/cockroach/pkg/kv/kvpb.(*RangeFeedCheckpoint).MarshalToSizedBuffer
| github.com/cockroachdb/cockroach/pkg/kv/kvpb/bazel-out/k8-opt/bin/pkg/kv/kvpb/kvpb_go_proto_/github.com/cockroachdb/cockroach/pkg/kv/kvpb/api.pb.go:20010
| github.com/cockroachdb/cockroach/pkg/kv/kvpb.(*RangeFeedEvent).MarshalToSizedBuffer
| github.com/cockroachdb/cockroach/pkg/kv/kvpb/bazel-out/k8-opt/bin/pkg/kv/kvpb/kvpb_go_proto_/github.com/cockroachdb/cockroach/pkg/kv/kvpb/api.pb.go:20263
| github.com/cockroachdb/cockroach/pkg/kv/kvpb.(*MuxRangeFeedEvent).MarshalToSizedBuffer
| github.com/cockroachdb/cockroach/pkg/kv/kvpb/bazel-out/k8-opt/bin/pkg/kv/kvpb/kvpb_go_proto_/github.com/cockroachdb/cockroach/pkg/kv/kvpb/api.pb.go:20319
| github.com/cockroachdb/cockroach/pkg/kv/kvpb.(*MuxRangeFeedEvent).Marshal
| github.com/cockroachdb/cockroach/pkg/kv/kvpb/bazel-out/k8-opt/bin/pkg/kv/kvpb/kvpb_go_proto_/github.com/cockroachdb/cockroach/pkg/kv/kvpb/api.pb.go:20291
| github.com/cockroachdb/cockroach/pkg/rpc.codec.Marshal
| github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/codec.go:29
| google.golang.org/grpc.encode
| google.golang.org/grpc/external/org_golang_google_grpc/rpc_util.go:632
| google.golang.org/grpc.prepareMsg
| google.golang.org/grpc/external/org_golang_google_grpc/stream.go:1766
| google.golang.org/grpc.(*serverStream).SendMsg
| google.golang.org/grpc/external/org_golang_google_grpc/stream.go:1642
| github.com/cockroachdb/cockroach/pkg/kv/kvpb.(*internalMuxRangeFeedServer).Send
| github.com/cockroachdb/cockroach/pkg/kv/kvpb/bazel-out/k8-opt/bin/pkg/kv/kvpb/kvpb_go_proto_/github.com/cockroachdb/cockroach/pkg/kv/kvpb/api.pb.go:10706
| github.com/cockroachdb/cockroach/pkg/server.(*lockedMuxStream).Send
| github.com/cockroachdb/cockroach/pkg/server/node.go:1987
| github.com/cockroachdb/cockroach/pkg/server.(*setRangeIDEventSink).Send
| github.com/cockroachdb/cockroach/pkg/server/node.go:1972
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*lockedRangefeedStream).Send
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:134
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed.(*registration).outputLoop
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed/registry.go:337
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed.(*registration).runOutputLoop
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed/registry.go:360
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed.(*ScheduledProcessor).Register.func1.1
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed/scheduled_processor.go:343
| github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
| github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:480
| runtime.goexit
| src/runtime/asm_amd64.s:1695
Wraps: (2) runtime error: index out of range [-1]
Error types: (1) *withstack.withStack (2) runtime.boundsError
The likely reason for this is that a RangeFeedEvent containing a RangeFeedCheckpoint is modified while being marshaled (marshaling computes the size needed first, then creates a buffer, and the panic results from the buffer being too small by the time marshaling actually happens).
The only shared memory in this message is a roachpb.Span. So what is likely happening is that the Key or EndKey are mutated.
Jira issue: CRDB-57715
Hi @tbg, please add a branch-* label to identify the earliest affected branch for this C-bug
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.