cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

rangefeed: likely data race in checkpoint marshaling

Open tbg opened this issue 3 weeks ago • 1 comments

From Sentry and now an escalation:

  • 24.3: https://github.com/cockroachdb/cockroach/issues/158166
  • 24.1: https://github.com/cockroachdb/cockroach/issues/157882
  • escalation on 24.1: https://github.com/cockroachlabs/support/issues/3506
  • cloud logs^1

We ocasionally see crashes when a MuxRangeFeedEvent is marshaled as part of sending over the wire.

runtime error: index out of range [-1]
(1) attached stack trace
  -- stack trace:
  | runtime.gopanic
  |     GOROOT/src/runtime/panic.go:770
  | runtime.goPanicIndex
  |     GOROOT/src/runtime/panic.go:114
  | github.com/cockroachdb/cockroach/pkg/kv/kvpb.encodeVarintApi
  |     github.com/cockroachdb/cockroach/pkg/kv/kvpb/bazel-out/k8-opt/bin/pkg/kv/kvpb/kvpb_go_proto_/github.com/cockroachdb/cockroach/pkg/kv/kvpb/api.pb.go:21220
  | github.com/cockroachdb/cockroach/pkg/kv/kvpb.(*RangeFeedCheckpoint).MarshalToSizedBuffer
  |     github.com/cockroachdb/cockroach/pkg/kv/kvpb/bazel-out/k8-opt/bin/pkg/kv/kvpb/kvpb_go_proto_/github.com/cockroachdb/cockroach/pkg/kv/kvpb/api.pb.go:20010
  | github.com/cockroachdb/cockroach/pkg/kv/kvpb.(*RangeFeedEvent).MarshalToSizedBuffer
  |     github.com/cockroachdb/cockroach/pkg/kv/kvpb/bazel-out/k8-opt/bin/pkg/kv/kvpb/kvpb_go_proto_/github.com/cockroachdb/cockroach/pkg/kv/kvpb/api.pb.go:20263
  | github.com/cockroachdb/cockroach/pkg/kv/kvpb.(*MuxRangeFeedEvent).MarshalToSizedBuffer
  |     github.com/cockroachdb/cockroach/pkg/kv/kvpb/bazel-out/k8-opt/bin/pkg/kv/kvpb/kvpb_go_proto_/github.com/cockroachdb/cockroach/pkg/kv/kvpb/api.pb.go:20319
  | github.com/cockroachdb/cockroach/pkg/kv/kvpb.(*MuxRangeFeedEvent).Marshal
  |     github.com/cockroachdb/cockroach/pkg/kv/kvpb/bazel-out/k8-opt/bin/pkg/kv/kvpb/kvpb_go_proto_/github.com/cockroachdb/cockroach/pkg/kv/kvpb/api.pb.go:20291
  | github.com/cockroachdb/cockroach/pkg/rpc.codec.Marshal
  |     github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/codec.go:29
  | google.golang.org/grpc.encode
  |     google.golang.org/grpc/external/org_golang_google_grpc/rpc_util.go:632
  | google.golang.org/grpc.prepareMsg
  |     google.golang.org/grpc/external/org_golang_google_grpc/stream.go:1766
  | google.golang.org/grpc.(*serverStream).SendMsg
  |     google.golang.org/grpc/external/org_golang_google_grpc/stream.go:1642
  | github.com/cockroachdb/cockroach/pkg/kv/kvpb.(*internalMuxRangeFeedServer).Send
  |     github.com/cockroachdb/cockroach/pkg/kv/kvpb/bazel-out/k8-opt/bin/pkg/kv/kvpb/kvpb_go_proto_/github.com/cockroachdb/cockroach/pkg/kv/kvpb/api.pb.go:10706
  | github.com/cockroachdb/cockroach/pkg/server.(*lockedMuxStream).Send
  |     github.com/cockroachdb/cockroach/pkg/server/node.go:1987
  | github.com/cockroachdb/cockroach/pkg/server.(*setRangeIDEventSink).Send
  |     github.com/cockroachdb/cockroach/pkg/server/node.go:1972
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*lockedRangefeedStream).Send
  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:134
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed.(*registration).outputLoop
  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed/registry.go:337
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed.(*registration).runOutputLoop
  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed/registry.go:360
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed.(*ScheduledProcessor).Register.func1.1
  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed/scheduled_processor.go:343
  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
  |     github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:480
  | runtime.goexit
  |     src/runtime/asm_amd64.s:1695
Wraps: (2) runtime error: index out of range [-1]
Error types: (1) *withstack.withStack (2) runtime.boundsError

The likely reason for this is that a RangeFeedEvent containing a RangeFeedCheckpoint is modified while being marshaled (marshaling computes the size needed first, then creates a buffer, and the panic results from the buffer being too small by the time marshaling actually happens).

The only shared memory in this message is a roachpb.Span. So what is likely happening is that the Key or EndKey are mutated.

Jira issue: CRDB-57715

tbg avatar Dec 10 '25 16:12 tbg

Hi @tbg, please add a branch-* label to identify the earliest affected branch for this C-bug

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

blathers-crl[bot] avatar Dec 10 '25 16:12 blathers-crl[bot]