vitessdriver: DialContext memory allocations / resource consumption
We were diving into performance into some hot services and found that somehow the vitessdriver / grpc DialContext path consumes a significant majority of resources. I am not sure what could be causing this. This also feels related to the panic in #12067 while applying dial options. It feels like maybe connections are getting rotated more often than expected maybe?
driver configuration, with custom dial options
return vitessdriver.Configuration{
Protocol: "grpc",
Address: db.address,
Target: db.target,
Streaming: db.streaming,
DriverName: db.driver,
// set dial options with max message sizes
GRPCDialOptions: []grpc.DialOption{
grpc.WithDefaultCallOptions(
grpc.MaxCallRecvMsgSize(maxMsgSize),
grpc.MaxCallSendMsgSize(maxMsgSize),
),
grpc.WithKeepaliveParams(keepalive.ClientParameters{
Time: 300 * time.Second,
Timeout: 600 * time.Second,
PermitWithoutStream: true,
}),
},
}
pprof -in_use
Showing top 10 nodes out of 60
flat flat% sum% cum cum%
79.84MB 37.57% 37.57% 79.84MB 37.57% vitess.io/vitess/go/vt/grpcclient.DialContext
50MB 23.53% 61.10% 70MB 32.94% google.golang.org/grpc.WithTransportCredentials (inline)
41.38MB 19.47% 80.57% 191.22MB 89.98% vitess.io/vitess/go/vt/vtgate/grpcvtgateconn.DialWithOpts.func1
20MB 9.41% 89.98% 20MB 9.41% google.golang.org/grpc.newFuncDialOption (inline)
4.41MB 2.07% 92.06% 5.48MB 2.58% compress/flate.NewWriter
2.13MB 1.00% 93.06% 2.13MB 1.00% google.golang.org/grpc/internal/transport.newBufWriter
2MB 0.94% 94.00% 2MB 0.94% github.com/aws/aws-sdk-go/aws/endpoints.init
1.55MB 0.73% 94.73% 1.55MB 0.73% bufio.NewReaderSize
1.50MB 0.71% 95.44% 1.50MB 0.71% runtime.malg
1.07MB 0.5% 95.94% 1.07MB 0.5% compress/flate.(*compressor).initDeflate
pprof -alloc_space
Showing top 10 nodes out of 61
flat flat% sum% cum cum%
51553.97GB 93.17% 93.17% 51574.63GB 93.21% vitess.io/vitess/go/vt/grpcclient.DialContext
807.67GB 1.46% 94.63% 999.71GB 1.81% compress/flate.NewWriter
90.37GB 0.16% 94.79% 627.66GB 1.13% github.com/maruel/panicparse/v2/stack.ScanSnapshot
30.77GB 0.056% 94.85% 662.38GB 1.20% go.nozzle.io/pkg/e.newErr
6.74GB 0.012% 94.86% 309.71GB 0.56% vitess.io/vitess/go/vt/vtgate/grpcvtgateconn.(*vtgateConn).Execute
5.58GB 0.01% 94.87% 916.13GB 1.66% google.golang.org/grpc.(*clientStream).SendMsg
3.37GB 0.0061% 94.88% 631.61GB 1.14% go.nozzle.io/pkg/e.parseStack
3.36GB 0.0061% 94.88% 290.98GB 0.53% vitess.io/vitess/go/vt/proto/vtgateservice.(*vitessClient).Execute
2.54GB 0.0046% 94.89% 319.57GB 0.58% github.com/maruel/panicparse/v2/stack.(*Snapshot).findRoots
2.19GB 0.004% 94.89% 53491.15GB 96.67% go.nozzle.io/pkg/workers.(*worker).work
pprof cpu profile
Showing top 10 nodes out of 125
flat flat% sum% cum cum%
14270ms 32.78% 32.78% 18260ms 41.95% google.golang.org/grpc.(*funcDialOption).apply
5350ms 12.29% 45.07% 7940ms 18.24% runtime.findObject
4330ms 9.95% 55.02% 14520ms 33.36% runtime.scanobject
2230ms 5.12% 60.14% 2370ms 5.44% runtime.spanOf (inline)
2170ms 4.99% 65.13% 2170ms 4.99% runtime.markBits.isMarked (inline)
1350ms 3.10% 68.23% 1350ms 3.10% runtime.heapBits.bits (inline)
1260ms 2.89% 71.12% 3440ms 7.90% vitess.io/vitess/go/vt/grpcclient.SecureDialOption.func1
730ms 1.68% 72.80% 730ms 1.68% runtime.memclrNoHeapPointers
660ms 1.52% 74.32% 660ms 1.52% runtime.asyncPreempt
640ms 1.47% 75.79% 690ms 1.59% runtime.pageIndexOf (inline)
all components are v15.0.2 google.golang.org/grpc v1.51.0
cc @MordFustang21
This is still an issue on v17.0.2, where it's using 82% of the RAM consumption, has allocated 9 TB is just 12 hours in a single process, and consumes ~33% cpu time
pprof -in_use
Showing nodes accounting for 145.53MB, 92.59% of 157.19MB total
Dropped 97 nodes (cum <= 0.79MB)
Showing top 10 nodes out of 92
flat flat% sum% cum cum%
83.71MB 53.26% 53.26% 83.71MB 53.26% vitess.io/vitess/go/vt/grpcclient.DialContext
22.50MB 14.31% 67.57% 28.50MB 18.13% google.golang.org/grpc.WithTransportCredentials (inline)
16.93MB 10.77% 78.34% 129.14MB 82.16% vitess.io/vitess/go/vt/vitessdriver.OpenWithConfiguration.Dial.func1
6.17MB 3.93% 82.27% 7.77MB 4.94% compress/flate.NewWriter
6MB 3.82% 86.08% 6MB 3.82% google.golang.org/grpc.newFuncDialOption (inline)
3.50MB 2.23% 88.31% 3.50MB 2.23% encoding/json.(*decodeState).literalStore
2.58MB 1.64% 89.95% 2.58MB 1.64% google.golang.org/grpc/internal/transport.newBufWriter
1.60MB 1.02% 90.97% 1.60MB 1.02% compress/flate.(*compressor).initDeflate (inline)
1.55MB 0.98% 91.95% 1.55MB 0.98% bufio.NewReaderSize
1MB 0.64% 92.59% 6.50MB 4.14% go.nozzle.io/data-domains/global-index/internal/old-html-loader/internal.(*App).GenerateMessages
pprof -alloc_space
Showing nodes accounting for 10.17TB, 83.53% of 12.18TB total
Dropped 1277 nodes (cum <= 0.06TB)
Showing top 10 nodes out of 126
flat flat% sum% cum cum%
9.01TB 73.94% 73.94% 9.01TB 74.00% vitess.io/vitess/go/vt/grpcclient.DialContext
0.22TB 1.78% 75.72% 0.25TB 2.06% google.golang.org/grpc/internal/transport.(*http2Client).createHeaderFields
0.16TB 1.28% 77.00% 0.16TB 1.28% encoding/json.(*Decoder).refill
0.15TB 1.20% 78.20% 0.18TB 1.46% compress/flate.NewWriter
0.15TB 1.20% 79.40% 0.15TB 1.20% go.opencensus.io/stats/view.encodeWithKeys
0.13TB 1.10% 80.50% 0.22TB 1.77% google.golang.org/grpc/internal/transport.(*http2Client).newStream
0.10TB 0.8% 81.31% 0.24TB 2.00% go.opencensus.io/stats/view.(*viewInternal).addSample
0.10TB 0.78% 82.09% 0.15TB 1.26% google.golang.org/grpc/internal/transport.(*http2Client).operateHeaders
0.09TB 0.78% 82.86% 0.82TB 6.71% google.golang.org/grpc.newClientStreamWithParams
0.08TB 0.67% 83.53% 0.08TB 0.67% go.opencensus.io/tag.(*Map).insert
pprof cpu profile
Showing nodes accounting for 101.58s, 83.33% of 121.90s total
Dropped 1131 nodes (cum <= 0.61s)
Showing top 30 nodes out of 184
flat flat% sum% cum cum%
33.04s 27.10% 27.10% 38.46s 31.55% google.golang.org/grpc.(*funcDialOption).apply
10.10s 8.29% 35.39% 13.70s 11.24% runtime.findObject
8.98s 7.37% 42.76% 32.87s 26.96% runtime.scanobject
7.49s 6.14% 48.90% 7.71s 6.32% runtime.(*gcBits).bitp (inline)
6.76s 5.55% 54.45% 6.76s 5.55% runtime/internal/syscall.Syscall6
3.90s 3.20% 57.65% 3.90s 3.20% runtime.(*mspan).base (inline)
3.15s 2.58% 60.23% 5.31s 4.36% vitess.io/vitess/go/vt/grpcclient.SecureDialOption.WithTransportCredentials.func1
3.09s 2.53% 62.76% 3.33s 2.73% runtime.spanOf (inline)
2.81s 2.31% 65.07% 2.81s 2.31% runtime.tgkill
2.56s 2.10% 67.17% 2.56s 2.10% runtime.heapBitsForAddr
2.38s 1.95% 69.12% 2.38s 1.95% runtime.memclrNoHeapPointers
2.12s 1.74% 70.86% 2.12s 1.74% runtime.futex
2.01s 1.65% 72.51% 2.01s 1.65% runtime.procyield
1.62s 1.33% 73.84% 40.24s 33.01% google.golang.org/grpc.DialContext
I just restarted the exact same workload, and after giving it time to warmup, dial options don't appear in the profile, despite running similar amounts of queries. There must be a memory leak or something
Showing nodes accounting for 21400ms, 49.22% of 43480ms total
Dropped 1064 nodes (cum <= 217.40ms)
Showing top 10 nodes out of 261
flat flat% sum% cum cum%
11470ms 26.38% 26.38% 11470ms 26.38% runtime/internal/syscall.Syscall6
4550ms 10.46% 36.84% 4550ms 10.46% runtime.futex
850ms 1.95% 38.80% 850ms 1.95% compress/flate.(*compressor).reset
830ms 1.91% 40.71% 830ms 1.91% runtime.nextFreeFast (inline)
720ms 1.66% 42.36% 720ms 1.66% runtime.memclrNoHeapPointers
710ms 1.63% 44.00% 710ms 1.63% runtime.usleep
700ms 1.61% 45.61% 4070ms 9.36% runtime.mallocgc
550ms 1.26% 46.87% 550ms 1.26% runtime.memmove
510ms 1.17% 48.05% 3300ms 7.59% runtime.newobject
510ms 1.17% 49.22% 1450ms 3.33% runtime.scanobject
Here's a CPU flamegraph
Here's the slow leak over time as shown via p99 latency growth
You can clearly see the two process restarts
I don't 100% follow the full path through database/sql and the Connector interface, but it appears like grpcclient.DialContext is happening on every single transaction. I expected that it would only occasionally redial whenever the underlying connection failed or hit an idle timeout. Maybe I'm interpreting things incorrectly.
@mdlayher You implemented the connector interface a few months ago. Do you have any insight here about how often dialing should be happening?
I haven't noticed the vitessdriver on any of our profiles, though we have several different methods we use to connect into databases and this is one of the least frequent ones only used for certain circumstances.
I have also seen the panic mentioned earlier though (https://github.com/vitessio/vitess/issues/12067) but have yet to track it down.
Apologies for not being more helpful! It's possible I implemented one of the database/sql interfaces in a suboptimal way though the original issue here does seem to predate my changes as well.
Thanks for the info, that's still helpful. The issue definitely predated your changes, and we didn't notice any change afterwards, so I'm confident that wasn't a contributor.
I'm suspicious of these lines, though I haven't found anything concrete yet. I'm guessing that not many people are passing in dial options like we are, which may be contributing to us seeing different behavior. https://github.com/vitessio/vitess/blob/54f0b33a652f929ecc43a949371d49364a3b94c8/go/vt/vitessdriver/driver.go#L110-L112
Our usage
return vitessdriver.Configuration{
Protocol: "grpc",
Address: db.address,
Target: db.target,
Streaming: db.streaming,
DriverName: db.driver,
// set dial options with max message sizes
GRPCDialOptions: []grpc.DialOption{
grpc.WithDefaultCallOptions(
grpc.MaxCallRecvMsgSize(ngrpc.MaxMsgSize),
grpc.MaxCallSendMsgSize(ngrpc.MaxMsgSize),
),
grpc.WithKeepaliveParams(keepalive.ClientParameters{
Time: 300 * time.Second,
Timeout: 600 * time.Second,
PermitWithoutStream: true,
}),
},
}
Here's my Go code:
return vitessdriver.Configuration{
Protocol: fmt.Sprintf("grpc-%s", protocolID),
// Specify a keyspace and cluster type for the vtgate session.
Target: fmt.Sprintf("%s@%s", keyspace, cluster),
Address: "xxx",
GRPCDialOptions: append(
[]grpc.DialOption{
grpc.WithPerRPCCredentials(&grpcclient.StaticAuthClientCreds{
Username: "xxx",
Password: "xxx",
}),
},
// mTLS secures the underlying transport.
grpc.WithTransportCredentials(insecure.NewCredentials()),
),
}, nil
And later:
cfg.GRPCDialOptions = append(cfg.GRPCDialOptions,
grpc.WithContextDialer(func(ctx context.Context, addr string) (net.Conn, error) {
return tunnel.DialContext(ctx, "tcp", addr)
}),
)