thanos icon indicating copy to clipboard operation
thanos copied to clipboard

Receiver: Fix data race in writer (label set copying)

Open matej-g opened this issue 3 years ago • 4 comments

Signed-off-by: Matej Gera [email protected]

  • [ ] I added CHANGELOG entry for this change.
  • [ ] Change is not relevant to the end user.

Changes

This is continuation of fixes detected during work on #4664.

We appear to be hitting a data race due to some concurrent read / write in write request's labels. See output:

 WARNING: DATA RACE
 Read at 0x00c000cc4f00 by goroutine 40:
 runtime.racereadrange()
 <autogenerated>:1 +0x1b
 github.com/thanos-io/thanos/pkg/store/storepb.(*WriteRequest).Size()
 /go/src/github.com/thanos-io/thanos/pkg/store/storepb/rpc.pb.go:2011 +0x224
 github.com/thanos-io/thanos/pkg/store/storepb.(*WriteRequest).Marshal()
 /go/src/github.com/thanos-io/thanos/pkg/store/storepb/rpc.pb.go:1204 +0x30
 google.golang.org/protobuf/internal/impl.legacyMarshal()
 /go/pkg/mod/google.golang.org/[email protected]/internal/impl/legacy_message.go:404 +0xbe
 google.golang.org/protobuf/proto.MarshalOptions.marshal()
 /go/pkg/mod/google.golang.org/[email protected]/proto/encode.go:163 +0x3b9
 google.golang.org/protobuf/proto.MarshalOptions.MarshalAppend()
 /go/pkg/mod/google.golang.org/[email protected]/proto/encode.go:122 +0xa5
 github.com/golang/protobuf/proto.marshalAppend()
 /go/pkg/mod/github.com/golang/[email protected]/proto/wire.go:40 +0xe4
 github.com/golang/protobuf/proto.Marshal()
 /go/pkg/mod/github.com/golang/[email protected]/proto/wire.go:23 +0x67
 google.golang.org/grpc/encoding/proto.codec.Marshal()
 /go/pkg/mod/google.golang.org/[email protected]/encoding/proto/proto.go:45 +0x68
 google.golang.org/grpc/encoding/proto.(*codec).Marshal()
 <autogenerated>:1 +0x5b
 google.golang.org/grpc.encode()
 /go/pkg/mod/google.golang.org/[email protected]/rpc_util.go:594 +0x6a
 google.golang.org/grpc.prepareMsg()
 /go/pkg/mod/google.golang.org/[email protected]/stream.go:1603 +0x1ab
 google.golang.org/grpc.(*clientStream).SendMsg()
 /go/pkg/mod/google.golang.org/[email protected]/stream.go:784 +0x29b
 google.golang.org/grpc.invoke()
 /go/pkg/mod/google.golang.org/[email protected]/call.go:70 +0xfe
 github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors.UnaryClientInterceptor.func1()
 /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/[email protected]/interceptors/client.go:22 +0x29a
 github.com/grpc-ecosystem/go-grpc-middleware/v2.ChainUnaryClient.func1.1.1()
 /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/[email protected]/chain.go:74 +0x112
 github.com/grpc-ecosystem/go-grpc-prometheus.(*ClientMetrics).UnaryClientInterceptor.func1()
 /go/pkg/mod/github.com/grpc-ecosystem/[email protected]/client_metrics.go:112 +0x133
 github.com/grpc-ecosystem/go-grpc-middleware/v2.ChainUnaryClient.func1.1.1()
 /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/[email protected]/chain.go:74 +0x112
 github.com/grpc-ecosystem/go-grpc-middleware/v2.ChainUnaryClient.func1()
 /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/[email protected]/chain.go:83 +0x188
 google.golang.org/grpc.(*ClientConn).Invoke()
 /go/pkg/mod/google.golang.org/[email protected]/call.go:35 +0x265
 github.com/thanos-io/thanos/pkg/store/storepb.(*writeableStoreClient).RemoteWrite()
 /go/src/github.com/thanos-io/thanos/pkg/store/storepb/rpc.pb.go:1124 +0xd4
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward.func4.2()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:643 +0x246
 github.com/thanos-io/thanos/pkg/tracing.DoInSpan()
 /go/src/github.com/thanos-io/thanos/pkg/tracing/tracing.go:93 +0x135
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward.func4()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:641 +0x41d
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward·dwrap·9()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:674 +0x58
 Previous write at 0x00c000cc4f00 by goroutine 117:
 github.com/thanos-io/thanos/pkg/store/labelpb.ReAllocZLabelsStrings()
 /go/src/github.com/thanos-io/thanos/pkg/store/labelpb/label.go:50 +0x147
 github.com/thanos-io/thanos/pkg/receive.(*Writer).Write()
 /go/src/github.com/thanos-io/thanos/pkg/receive/writer.go:80 +0x2865
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward.func3.1()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:591 +0xfe
 github.com/thanos-io/thanos/pkg/tracing.DoInSpan()
 /go/src/github.com/thanos-io/thanos/pkg/tracing/tracing.go:93 +0x135
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward.func3()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:590 +0x1ed
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward·dwrap·8()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:601 +0x58
 Goroutine 40 (running) created at:
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:607 +0xd67
 github.com/thanos-io/thanos/pkg/receive.(*Handler).replicate()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:748 +0x3d3
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward.func2.1()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:564 +0xda
 github.com/thanos-io/thanos/pkg/tracing.DoInSpan()
 /go/src/github.com/thanos-io/thanos/pkg/tracing/tracing.go:93 +0x135
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward.func2()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:563 +0x1cf
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward·dwrap·7()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:574 +0x58
 Goroutine 117 (running) created at:
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:586 +0x1078
 github.com/thanos-io/thanos/pkg/receive.(*Handler).replicate()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:748 +0x3d3
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward.func2.1()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:564 +0xda
 github.com/thanos-io/thanos/pkg/tracing.DoInSpan()
 /go/src/github.com/thanos-io/thanos/pkg/tracing/tracing.go:93 +0x135
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward.func2()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:563 +0x1cf
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward·dwrap·7()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:574 +0x58

Based on the logs, my assumption is that in the handler, we are re-using the *prompb.WriteRequest in the fanoutForward method - once for the local node Write and then for the remote forward in RemoteWrite. It seems that doing some reallocation shuffling with the labels can interfere with gRPC message marshalling, as it seem to be re-using the same underlying memory.

The suggestion here is to make a short-lived separate copy of the label set (if ref is not available) before proceeding with the write on the local node. This should ensure we are no longer writing to the same object as we're reading from during marshalling.

I'm not 100% not sure of performance implications though, seeing this is the main write path for receive.

Verification

Rate the E2E reciever tests on top of #5066 with no data race being detected anymore.

matej-g avatar Jul 12 '22 14:07 matej-g

I also added benchmark for realloc version with DeepCopy it seems to be slightly slower than current version:

go test -benchmem -benchtime=10s -run=^$ -tags integration -bench ^BenchmarkTransformWithAndWithoutCopy           fix-data-race-receive-writer-labels
goos: linux
goarch: amd64
pkg: github.com/thanos-io/thanos/pkg/store/labelpb
cpu: Intel(R) Core(TM) i7-9850H CPU @ 2.60GHz
BenchmarkTransformWithAndWithoutCopy/ZLabelsToPromLabels-12             1000000000               1.535 ns/op           0 B/op          0 allocs/op
BenchmarkTransformWithAndWithoutCopy/ZLabelsToPromLabelsWithRealloc-12               211          57434846 ns/op        96000000 B/op    2000000 allocs/op
BenchmarkTransformWithAndWithoutCopy/ZLabelsToPromLabelsWithDeepCopy-12              190          61866060 ns/op        128006144 B/op   2000001 allocs/op
PASS
ok      github.com/thanos-io/thanos/pkg/store/labelpb   41.895s

matej-g avatar Jul 29 '22 13:07 matej-g

The results seem to show quite a big difference (from 1.535 ns/op to 61866060 ns/op, which means 61ms), but only reading code above we can see that's the use case for 1000000 labels, which is quite extreme. Also the realloc itself is like 57ms...

I suspect the performance difference for scenarios like 50 labels or so would be unnoticeable. There might still be interesting change in a high throughput scenario.

douglascamata avatar Aug 04 '22 10:08 douglascamata

What if we protect the realloc for local write with a mutex and check it before the fanout gRPC writes? 🤔

douglascamata avatar Aug 04 '22 10:08 douglascamata

I am not fully bought on this honestly, need to look deeper 🙈

bwplotka avatar Aug 09 '22 12:08 bwplotka

Hello 👋 Looks like there was no activity on this amazing PR for the last 30 days. Do you mind updating us on the status? Is there anything we can help with? If you plan to still work on it, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next week, this issue will be closed (we can always reopen a PR if you get back to this!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] avatar Oct 30 '22 00:10 stale[bot]

Closing for now as promised, let us know if you need this to be reopened! 🤗

stale[bot] avatar Nov 12 '22 12:11 stale[bot]