buildkit icon indicating copy to clipboard operation
buildkit copied to clipboard

Panic due to index out of bounds when using Jaeger exporter

Open gsaraf opened this issue 2 years ago • 1 comments

I'm using Buildkitd (v0.10.3) in a container, and have used the JAEGER_TRACE env var to enable open telemetry tracing pointed to a Jaeger all in one instance. This Buildkitd instance will server several clients, occasionally concurrently. Every so often, the Buildkitd process will panic with a stack trace (two examples below). This is bad because any running builds will be discontinued, and clients get an ugly error.

The stack traces are coming from the Jaeger exporter, so I first opened an issue there: https://github.com/open-telemetry/opentelemetry-go/issues/3063. They suggested that it could be a concurrent usage problem - apparently the ExportSpans is not thread-safe, and it isn't clear that control/control.go:Export guards it.

panic: runtime error: slice bounds out of range [3483:3471]

goroutine 247642 [running]:
bytes.(*Buffer).Write(0xc003cf9a88, {0xc000c8423d, 0xeda7da3fd, 0x40323e})
    /usr/local/go/src/bytes/buffer.go:174 +0xd6
go.opentelemetry.io/otel/exporters/jaeger/internal/third_party/thrift/lib/go/thrift.(*TCompactProtocol).writeVarint32(0xe28c8a, 0x158640)
    /src/vendor/go.opentelemetry.io/otel/exporters/jaeger/internal/third_party/thrift/lib/go/thrift/compact_protocol.go:691 +0x6b
go.opentelemetry.io/otel/exporters/jaeger/internal/third_party/thrift/lib/go/thrift.(*TCompactProtocol).WriteI16(0xc0004d8280, {0xc000a74c30, 0x8}, 0xd340)
    /src/vendor/go.opentelemetry.io/otel/exporters/jaeger/internal/third_party/thrift/lib/go/thrift/compact_protocol.go:307 +0x26
go.opentelemetry.io/otel/exporters/jaeger/internal/third_party/thrift/lib/go/thrift.(*TCompactProtocol).writeFieldBeginInternal(0xc000c841e0, {0x154cb20, 0xc007b02500}, {0xc003cf87d8, 0x0}, 0x8, 0x1, 0x0)
    /src/vendor/go.opentelemetry.io/otel/exporters/jaeger/internal/third_party/thrift/lib/go/thrift/compact_protocol.go:234 +0xed
go.opentelemetry.io/otel/exporters/jaeger/internal/third_party/thrift/lib/go/thrift.(*TCompactProtocol).WriteFieldBegin(0xc000c841e0, {0x154cb20, 0xc007b02500}, {0x137455b, 0x40323e}, 0xc8, 0x3f5)
    /src/vendor/go.opentelemetry.io/otel/exporters/jaeger/internal/third_party/thrift/lib/go/thrift/compact_protocol.go:202 +0x73
go.opentelemetry.io/otel/exporters/jaeger/internal/gen-go/jaeger.(*Tag).writeField1(0xc005265c70, {0x154cb20, 0xc007b02500}, {0x1582028, 0xc000c841e0})
    /src/vendor/go.opentelemetry.io/otel/exporters/jaeger/internal/gen-go/jaeger/jaeger.go:456 +0x64\go.opentelemetry.io/otel/exporters/jaeger/internal/gen-go/jaeger.(*Tag).Write(0xc005265c70, {0x154cb20, 0xc007b02500}, {0x1582028, 0xc000c841e0})
    /src/vendor/go.opentelemetry.io/otel/exporters/jaeger/internal/gen-go/jaeger/jaeger.go:424 +0x90
go.opentelemetry.io/otel/exporters/jaeger/internal/gen-go/jaeger.(*Log).writeField2(0xc0008be280, {0x154cb20, 0xc007b02500}, {0x1582028, 0xc000c841e0})
    /src/vendor/go.opentelemetry.io/otel/exporters/jaeger/internal/gen-go/jaeger/jaeger.go:763 +0x1e5
go.opentelemetry.io/otel/exporters/jaeger/internal/gen-go/jaeger.(*Log).Write(0xc0008be280, {0x154cb20, 0xc007b02500}, {0x1582028, 0xc000c841e0})
    /src/vendor/go.opentelemetry.io/otel/exporters/jaeger/internal/gen-go/jaeger/jaeger.go:729 +0xb3
go.opentelemetry.io/otel/exporters/jaeger/internal/gen-go/jaeger.(*Span).writeField11(0xc006bde2d0, {0x154cb20, 0xc007b02500}, {0x1582028, 0xc000c841e0})
    /src/vendor/go.opentelemetry.io/otel/exporters/jaeger/internal/gen-go/jaeger/jaeger.go:1679 +0x205
go.opentelemetry.io/otel/exporters/jaeger/internal/gen-go/jaeger.(*Span).Write(0xc006bde2d0, {0x154cb20, 0xc007b02500}, {0x1582028, 0xc000c841e0})
    /src/vendor/go.opentelemetry.io/otel/exporters/jaeger/internal/gen-go/jaeger/jaeger.go:1507 +0x227
go.opentelemetry.io/otel/exporters/jaeger.(*agentClientUDP).calcSizeOfSerializedThrift(0xc000700c60, {0x154cb20, 0xc007b02500}, {0x1532d68, 0xc006bde2d0})
    /src/vendor/go.opentelemetry.io/otel/exporters/jaeger/agent.go:205 +0x5c
go.opentelemetry.io/otel/exporters/jaeger.(*agentClientUDP).EmitBatch(0xc000700c60, {0x154cb20, 0xc007b02500}, 0xc009a162a0)
    /src/vendor/go.opentelemetry.io/otel/exporters/jaeger/agent.go:137 +0x188
go.opentelemetry.io/otel/exporters/jaeger.(*agentUploader).upload(0xc0063cc420, {0x154cb20, 0xc007b02500}, 0xc000114340)
    /src/vendor/go.opentelemetry.io/otel/exporters/jaeger/uploader.go:269 +0x26
go.opentelemetry.io/otel/exporters/jaeger.(*Exporter).ExportSpans(0xc00040ee00, {0x154cb20, 0xc007b022c0}, {0xc0063cc420, 0x4, 0x6})
    /src/vendor/go.opentelemetry.io/otel/exporters/jaeger/jaeger.go:105 +0x29c
github.com/moby/buildkit/control.(*Controller).Export(0xc000c4af00, {0x154cb20, 0xc007b022c0}, 0x75e363b52d889d26)
    /src/control/control.go:201 +0x97
go.opentelemetry.io/proto/otlp/collector/trace/v1._TraceService_Export_Handler.func1({0x154cb20, 0xc007b022c0}, {0x125c620, 0xc000e755c0})
    /src/vendor/go.opentelemetry.io/proto/otlp/collector/trace/v1/trace_service_grpc.pb.go:85 +0x78
github.com/moby/buildkit/util/grpcerrors.UnaryServerInterceptor({0x154cb20, 0xc007b022c0}, {0x125c620, 0xc000e755c0}, 0xc0005a8680, 0x109a6c8)
    /src/util/grpcerrors/intercept.go:14 +0x3d
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x154cb20, 0xc007b022c0}, {0x125c620, 0xc000e755c0})
    /src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x3a
main.unaryInterceptor.func1({0x154cbc8, 0xc009a16120}, {0x125c620, 0xc000e755c0}, 0xc000c15b40, 0xc000c15b60)
    /src/cmd/buildkitd/main.go:563 +0x31f
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x154cbc8, 0xc009a16120}, {0x125c620, 0xc000e755c0})
    /src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x3a
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1({0x154cbc8, 0xc009a16120}, {0x125c620, 0xc000e755c0}, 0xc00275abd0, 0x11c54a0)
    /src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34 +0xbf
go.opentelemetry.io/proto/otlp/collector/trace/v1._TraceService_Export_Handler({0x12dab20, 0xc000c4af00}, {0x154cbc8, 0xc009a16120}, 0xc008dd6180, 0xc000c3aea0)
    /src/vendor/go.opentelemetry.io/proto/otlp/collector/trace/v1/trace_service_grpc.pb.go:87 +0x138
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000c50540, {0x15651f0, 0xc001215ba0}, 0xc000e3e240, 0xc000918b70, 0x1e76410, 0x0)
    /src/vendor/google.golang.org/grpc/server.go:1282 +0xccf
google.golang.org/grpc.(*Server).handleStream(0xc000c50540, {0x15651f0, 0xc001215ba0}, 0xc000e3e240, 0x0)
    /src/vendor/google.golang.org/grpc/server.go:1619 +0xa2a
google.golang.org/grpc.(*Server).serveStreams.func1.2()
    /src/vendor/google.golang.org/grpc/server.go:921 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
    /src/vendor/google.golang.org/grpc/server.go:919 +0x294	

Stack trace 2:

panic: runtime error: index out of range [1241] with length 0

goroutine 10949 [running]:
bytes.(*Buffer).WriteByte(0xc000740640, 0x27)
	/usr/local/go/src/bytes/buffer.go:269 +0x85
go.opentelemetry.io/otel/exporters/jaeger/internal/third_party/thrift/lib/go/thrift.(*TCompactProtocol).writeByteDirect(...)
	/src/vendor/go.opentelemetry.io/otel/exporters/jaeger/internal/third_party/thrift/lib/go/thrift/compact_protocol.go:736
go.opentelemetry.io/otel/exporters/jaeger/internal/third_party/thrift/lib/go/thrift.(*TCompactProtocol).writeFieldBeginInternal(0xc000bda500, {0x154cb20, 0xc001edcd40}, {0x1375a58, 0x0}, 0x5, 0x2, 0x0)
	/src/vendor/go.opentelemetry.io/otel/exporters/jaeger/internal/third_party/thrift/lib/go/thrift/compact_protocol.go:224 +0x9f
go.opentelemetry.io/otel/exporters/jaeger/internal/third_party/thrift/lib/go/thrift.(*TCompactProtocol).WriteFieldBegin(0xc000bda500, {0x154cb20, 0xc001edcd40}, {0x137616f, 0x40323e}, 0x3e, 0x40)
	/src/vendor/go.opentelemetry.io/otel/exporters/jaeger/internal/third_party/thrift/lib/go/thrift/compact_protocol.go:202 +0x73
go.opentelemetry.io/otel/exporters/jaeger/internal/gen-go/jaeger.(*Tag).writeField2(0xc002052aa0, {0x154cb20, 0xc001edcd40}, {0x1582028, 0xc000bda500})
	/src/vendor/go.opentelemetry.io/otel/exporters/jaeger/internal/gen-go/jaeger/jaeger.go:469 +0x64
go.opentelemetry.io/otel/exporters/jaeger/internal/gen-go/jaeger.(*Tag).Write(0xc002052aa0, {0x154cb20, 0xc001edcd40}, {0x1582028, 0xc000bda500})
	/src/vendor/go.opentelemetry.io/otel/exporters/jaeger/internal/gen-go/jaeger/jaeger.go:427 +0xb7
go.opentelemetry.io/otel/exporters/jaeger/internal/gen-go/jaeger.(*Log).writeField2(0xc001848ce0, {0x154cb20, 0xc001edcd40}, {0x1582028, 0xc000bda500})
	/src/vendor/go.opentelemetry.io/otel/exporters/jaeger/internal/gen-go/jaeger/jaeger.go:763 +0x1e5
go.opentelemetry.io/otel/exporters/jaeger/internal/gen-go/jaeger.(*Log).Write(0xc001848ce0, {0x154cb20, 0xc001edcd40}, {0x1582028, 0xc000bda500})
	/src/vendor/go.opentelemetry.io/otel/exporters/jaeger/internal/gen-go/jaeger/jaeger.go:729 +0xb3
go.opentelemetry.io/otel/exporters/jaeger/internal/gen-go/jaeger.(*Span).writeField11(0xc0006bf680, {0x154cb20, 0xc001edcd40}, {0x1582028, 0xc000bda500})
	/src/vendor/go.opentelemetry.io/otel/exporters/jaeger/internal/gen-go/jaeger/jaeger.go:1679 +0x205
go.opentelemetry.io/otel/exporters/jaeger/internal/gen-go/jaeger.(*Span).Write(0xc0006bf680, {0x154cb20, 0xc001edcd40}, {0x1582028, 0xc000bda500})
	/src/vendor/go.opentelemetry.io/otel/exporters/jaeger/internal/gen-go/jaeger/jaeger.go:1507 +0x227
go.opentelemetry.io/otel/exporters/jaeger.(*agentClientUDP).calcSizeOfSerializedThrift(0xc000596240, {0x154cb20, 0xc001edcd40}, {0x1532d68, 0xc0006bf680})
	/src/vendor/go.opentelemetry.io/otel/exporters/jaeger/agent.go:205 +0x5c
go.opentelemetry.io/otel/exporters/jaeger.(*agentClientUDP).EmitBatch(0xc000596240, {0x154cb20, 0xc001edcd40}, 0xc00192bc80)
	/src/vendor/go.opentelemetry.io/otel/exporters/jaeger/agent.go:137 +0x188
go.opentelemetry.io/otel/exporters/jaeger.(*agentUploader).upload(0xc0022cf360, {0x154cb20, 0xc001edcd40}, 0xc000422a80)
	/src/vendor/go.opentelemetry.io/otel/exporters/jaeger/uploader.go:269 +0x26
go.opentelemetry.io/otel/exporters/jaeger.(*Exporter).ExportSpans(0xc0000c1380, {0x154cb20, 0xc001edc9c0}, {0xc0022cf360, 0x5, 0x5})
	/src/vendor/go.opentelemetry.io/otel/exporters/jaeger/jaeger.go:105 +0x29c
github.com/moby/buildkit/control.(*Controller).Export(0xc000bdaaa0, {0x154cb20, 0xc001edc9c0}, 0x432d2fa0dc942e73)
	/src/control/control.go:201 +0x97
go.opentelemetry.io/proto/otlp/collector/trace/v1._TraceService_Export_Handler.func1({0x154cb20, 0xc001edc9c0}, {0x125c620, 0xc00175e2c0})
	/src/vendor/go.opentelemetry.io/proto/otlp/collector/trace/v1/trace_service_grpc.pb.go:85 +0x78
github.com/moby/buildkit/util/grpcerrors.UnaryServerInterceptor({0x154cb20, 0xc001edc9c0}, {0x125c620, 0xc00175e2c0}, 0xc000e09ba0, 0x109a6c8)
	/src/util/grpcerrors/intercept.go:14 +0x3d
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x154cb20, 0xc001edc9c0}, {0x125c620, 0xc00175e2c0})
	/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x3a
main.unaryInterceptor.func1({0x154cbc8, 0xc00192b320}, {0x125c620, 0xc00175e2c0}, 0xc001848800, 0xc001848840)
	/src/cmd/buildkitd/main.go:563 +0x31f
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x154cbc8, 0xc00192b320}, {0x125c620, 0xc00175e2c0})
	/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x3a
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1({0x154cbc8, 0xc00192b320}, {0x125c620, 0xc00175e2c0}, 0xc0010edbd0, 0x11c54a0)
	/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34 +0xbf
go.opentelemetry.io/proto/otlp/collector/trace/v1._TraceService_Export_Handler({0x12dab20, 0xc000bdaaa0}, {0x154cbc8, 0xc00192b320}, 0xc00185f1a0, 0xc000119680)
	/src/vendor/go.opentelemetry.io/proto/otlp/collector/trace/v1/trace_service_grpc.pb.go:87 +0x138
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000a6c000, {0x15651f0, 0xc000e76820}, 0xc000646120, 0xc000591d70, 0x1e76410, 0x0)
	/src/vendor/google.golang.org/grpc/server.go:1282 +0xccf
google.golang.org/grpc.(*Server).handleStream(0xc000a6c000, {0x15651f0, 0xc000e76820}, 0xc000646120, 0x0)
	/src/vendor/google.golang.org/grpc/server.go:1619 +0xa2a
google.golang.org/grpc.(*Server).serveStreams.func1.2()
	/src/vendor/google.golang.org/grpc/server.go:921 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/src/vendor/google.golang.org/grpc/server.go:919 +0x294

Does it make sense to add a mutex around access to the TraceCollector there? I'm happy to open a PR given some guidance :)

Thanks!

gsaraf avatar Aug 05 '22 08:08 gsaraf

I was able to reproduce this issue relatively consistently by issuing 10ish having JAEGER_TRACE set to something, and issuing 10ish buildctl commands in parallel such as:

buildctl build --frontend dockerfile.v0 --local context=. --local dockerfile=. --opt filename=Dockerfile --output type=image,name=REGISTRY_URL,push=true --export-cache type=registry,ref=CACHE_REGISTRY_URL,mode=max

At most after 3 tries, the buildkitd process would crash.

To fix, I tried adding another mutex to the Controller struct in control/control.go, and locking around the call to ExportSpans. This seems to have solved it - I've run the 10 builds in parallel over 10 times, and so far it hasn't crashed once.

Does this make sense? Shall I submit a PR?

gsaraf avatar Aug 06 '22 20:08 gsaraf