vitess icon indicating copy to clipboard operation
vitess copied to clipboard

vitessdriver: DialContext memory allocations / resource consumption

Open derekperkins opened this issue 3 years ago • 7 comments

We were diving into performance into some hot services and found that somehow the vitessdriver / grpc DialContext path consumes a significant majority of resources. I am not sure what could be causing this. This also feels related to the panic in #12067 while applying dial options. It feels like maybe connections are getting rotated more often than expected maybe?

driver configuration, with custom dial options

	return vitessdriver.Configuration{
		Protocol:  "grpc",
		Address:   db.address,
		Target:    db.target,
		Streaming: db.streaming,

		DriverName: db.driver,

		// set dial options with max message sizes
		GRPCDialOptions: []grpc.DialOption{
			grpc.WithDefaultCallOptions(
				grpc.MaxCallRecvMsgSize(maxMsgSize),
				grpc.MaxCallSendMsgSize(maxMsgSize),
			),
			grpc.WithKeepaliveParams(keepalive.ClientParameters{
				Time:                300 * time.Second,
				Timeout:             600 * time.Second,
				PermitWithoutStream: true,
			}),
		},
	}

pprof -in_use

Showing top 10 nodes out of 60
      flat  flat%   sum%        cum   cum%
   79.84MB 37.57% 37.57%    79.84MB 37.57%  vitess.io/vitess/go/vt/grpcclient.DialContext
      50MB 23.53% 61.10%       70MB 32.94%  google.golang.org/grpc.WithTransportCredentials (inline)
   41.38MB 19.47% 80.57%   191.22MB 89.98%  vitess.io/vitess/go/vt/vtgate/grpcvtgateconn.DialWithOpts.func1
      20MB  9.41% 89.98%       20MB  9.41%  google.golang.org/grpc.newFuncDialOption (inline)
    4.41MB  2.07% 92.06%     5.48MB  2.58%  compress/flate.NewWriter
    2.13MB  1.00% 93.06%     2.13MB  1.00%  google.golang.org/grpc/internal/transport.newBufWriter
       2MB  0.94% 94.00%        2MB  0.94%  github.com/aws/aws-sdk-go/aws/endpoints.init
    1.55MB  0.73% 94.73%     1.55MB  0.73%  bufio.NewReaderSize
    1.50MB  0.71% 95.44%     1.50MB  0.71%  runtime.malg
    1.07MB   0.5% 95.94%     1.07MB   0.5%  compress/flate.(*compressor).initDeflate 

pulls2-dial-context

pprof -alloc_space

Showing top 10 nodes out of 61
      flat  flat%   sum%        cum   cum%
51553.97GB 93.17% 93.17% 51574.63GB 93.21%  vitess.io/vitess/go/vt/grpcclient.DialContext
  807.67GB  1.46% 94.63%   999.71GB  1.81%  compress/flate.NewWriter
   90.37GB  0.16% 94.79%   627.66GB  1.13%  github.com/maruel/panicparse/v2/stack.ScanSnapshot
   30.77GB 0.056% 94.85%   662.38GB  1.20%  go.nozzle.io/pkg/e.newErr
    6.74GB 0.012% 94.86%   309.71GB  0.56%  vitess.io/vitess/go/vt/vtgate/grpcvtgateconn.(*vtgateConn).Execute
    5.58GB  0.01% 94.87%   916.13GB  1.66%  google.golang.org/grpc.(*clientStream).SendMsg
    3.37GB 0.0061% 94.88%   631.61GB  1.14%  go.nozzle.io/pkg/e.parseStack
    3.36GB 0.0061% 94.88%   290.98GB  0.53%  vitess.io/vitess/go/vt/proto/vtgateservice.(*vitessClient).Execute
    2.54GB 0.0046% 94.89%   319.57GB  0.58%  github.com/maruel/panicparse/v2/stack.(*Snapshot).findRoots
    2.19GB 0.004% 94.89% 53491.15GB 96.67%  go.nozzle.io/pkg/workers.(*worker).work

pulls2-dial-context-alloc-space

pprof cpu profile

Showing top 10 nodes out of 125
      flat  flat%   sum%        cum   cum%
   14270ms 32.78% 32.78%    18260ms 41.95%  google.golang.org/grpc.(*funcDialOption).apply
    5350ms 12.29% 45.07%     7940ms 18.24%  runtime.findObject
    4330ms  9.95% 55.02%    14520ms 33.36%  runtime.scanobject
    2230ms  5.12% 60.14%     2370ms  5.44%  runtime.spanOf (inline)
    2170ms  4.99% 65.13%     2170ms  4.99%  runtime.markBits.isMarked (inline)
    1350ms  3.10% 68.23%     1350ms  3.10%  runtime.heapBits.bits (inline)
    1260ms  2.89% 71.12%     3440ms  7.90%  vitess.io/vitess/go/vt/grpcclient.SecureDialOption.func1
     730ms  1.68% 72.80%      730ms  1.68%  runtime.memclrNoHeapPointers
     660ms  1.52% 74.32%      660ms  1.52%  runtime.asyncPreempt
     640ms  1.47% 75.79%      690ms  1.59%  runtime.pageIndexOf (inline)

pulls2-dial-context-cpu

all components are v15.0.2 google.golang.org/grpc v1.51.0

cc @MordFustang21

derekperkins avatar Jan 10 '23 22:01 derekperkins

This is still an issue on v17.0.2, where it's using 82% of the RAM consumption, has allocated 9 TB is just 12 hours in a single process, and consumes ~33% cpu time

pprof -in_use

Showing nodes accounting for 145.53MB, 92.59% of 157.19MB total
Dropped 97 nodes (cum <= 0.79MB)
Showing top 10 nodes out of 92
      flat  flat%   sum%        cum   cum%
   83.71MB 53.26% 53.26%    83.71MB 53.26%  vitess.io/vitess/go/vt/grpcclient.DialContext
   22.50MB 14.31% 67.57%    28.50MB 18.13%  google.golang.org/grpc.WithTransportCredentials (inline)
   16.93MB 10.77% 78.34%   129.14MB 82.16%  vitess.io/vitess/go/vt/vitessdriver.OpenWithConfiguration.Dial.func1
    6.17MB  3.93% 82.27%     7.77MB  4.94%  compress/flate.NewWriter
       6MB  3.82% 86.08%        6MB  3.82%  google.golang.org/grpc.newFuncDialOption (inline)
    3.50MB  2.23% 88.31%     3.50MB  2.23%  encoding/json.(*decodeState).literalStore
    2.58MB  1.64% 89.95%     2.58MB  1.64%  google.golang.org/grpc/internal/transport.newBufWriter
    1.60MB  1.02% 90.97%     1.60MB  1.02%  compress/flate.(*compressor).initDeflate (inline)
    1.55MB  0.98% 91.95%     1.55MB  0.98%  bufio.NewReaderSize
       1MB  0.64% 92.59%     6.50MB  4.14%  go.nozzle.io/data-domains/global-index/internal/old-html-loader/internal.(*App).GenerateMessages

pprof -alloc_space

Showing nodes accounting for 10.17TB, 83.53% of 12.18TB total
Dropped 1277 nodes (cum <= 0.06TB)
Showing top 10 nodes out of 126
      flat  flat%   sum%        cum   cum%
    9.01TB 73.94% 73.94%     9.01TB 74.00%  vitess.io/vitess/go/vt/grpcclient.DialContext
    0.22TB  1.78% 75.72%     0.25TB  2.06%  google.golang.org/grpc/internal/transport.(*http2Client).createHeaderFields
    0.16TB  1.28% 77.00%     0.16TB  1.28%  encoding/json.(*Decoder).refill
    0.15TB  1.20% 78.20%     0.18TB  1.46%  compress/flate.NewWriter
    0.15TB  1.20% 79.40%     0.15TB  1.20%  go.opencensus.io/stats/view.encodeWithKeys
    0.13TB  1.10% 80.50%     0.22TB  1.77%  google.golang.org/grpc/internal/transport.(*http2Client).newStream
    0.10TB   0.8% 81.31%     0.24TB  2.00%  go.opencensus.io/stats/view.(*viewInternal).addSample
    0.10TB  0.78% 82.09%     0.15TB  1.26%  google.golang.org/grpc/internal/transport.(*http2Client).operateHeaders
    0.09TB  0.78% 82.86%     0.82TB  6.71%  google.golang.org/grpc.newClientStreamWithParams
    0.08TB  0.67% 83.53%     0.08TB  0.67%  go.opencensus.io/tag.(*Map).insert

pprof cpu profile

Showing nodes accounting for 101.58s, 83.33% of 121.90s total
Dropped 1131 nodes (cum <= 0.61s)
Showing top 30 nodes out of 184
      flat  flat%   sum%        cum   cum%
    33.04s 27.10% 27.10%     38.46s 31.55%  google.golang.org/grpc.(*funcDialOption).apply
    10.10s  8.29% 35.39%     13.70s 11.24%  runtime.findObject
     8.98s  7.37% 42.76%     32.87s 26.96%  runtime.scanobject
     7.49s  6.14% 48.90%      7.71s  6.32%  runtime.(*gcBits).bitp (inline)
     6.76s  5.55% 54.45%      6.76s  5.55%  runtime/internal/syscall.Syscall6
     3.90s  3.20% 57.65%      3.90s  3.20%  runtime.(*mspan).base (inline)
     3.15s  2.58% 60.23%      5.31s  4.36%  vitess.io/vitess/go/vt/grpcclient.SecureDialOption.WithTransportCredentials.func1
     3.09s  2.53% 62.76%      3.33s  2.73%  runtime.spanOf (inline)
     2.81s  2.31% 65.07%      2.81s  2.31%  runtime.tgkill
     2.56s  2.10% 67.17%      2.56s  2.10%  runtime.heapBitsForAddr
     2.38s  1.95% 69.12%      2.38s  1.95%  runtime.memclrNoHeapPointers
     2.12s  1.74% 70.86%      2.12s  1.74%  runtime.futex
     2.01s  1.65% 72.51%      2.01s  1.65%  runtime.procyield
     1.62s  1.33% 73.84%     40.24s 33.01%  google.golang.org/grpc.DialContext

derekperkins avatar Dec 21 '23 04:12 derekperkins

I just restarted the exact same workload, and after giving it time to warmup, dial options don't appear in the profile, despite running similar amounts of queries. There must be a memory leak or something

Showing nodes accounting for 21400ms, 49.22% of 43480ms total
Dropped 1064 nodes (cum <= 217.40ms)
Showing top 10 nodes out of 261
      flat  flat%   sum%        cum   cum%
   11470ms 26.38% 26.38%    11470ms 26.38%  runtime/internal/syscall.Syscall6
    4550ms 10.46% 36.84%     4550ms 10.46%  runtime.futex
     850ms  1.95% 38.80%      850ms  1.95%  compress/flate.(*compressor).reset
     830ms  1.91% 40.71%      830ms  1.91%  runtime.nextFreeFast (inline)
     720ms  1.66% 42.36%      720ms  1.66%  runtime.memclrNoHeapPointers
     710ms  1.63% 44.00%      710ms  1.63%  runtime.usleep
     700ms  1.61% 45.61%     4070ms  9.36%  runtime.mallocgc
     550ms  1.26% 46.87%      550ms  1.26%  runtime.memmove
     510ms  1.17% 48.05%     3300ms  7.59%  runtime.newobject
     510ms  1.17% 49.22%     1450ms  3.33%  runtime.scanobject

derekperkins avatar Dec 21 '23 04:12 derekperkins

Here's a CPU flamegraph

image

Here's the slow leak over time as shown via p99 latency growth

You can clearly see the two process restarts

image

derekperkins avatar Dec 21 '23 05:12 derekperkins

I don't 100% follow the full path through database/sql and the Connector interface, but it appears like grpcclient.DialContext is happening on every single transaction. I expected that it would only occasionally redial whenever the underlying connection failed or hit an idle timeout. Maybe I'm interpreting things incorrectly.

@mdlayher You implemented the connector interface a few months ago. Do you have any insight here about how often dialing should be happening?

derekperkins avatar Dec 21 '23 06:12 derekperkins

I haven't noticed the vitessdriver on any of our profiles, though we have several different methods we use to connect into databases and this is one of the least frequent ones only used for certain circumstances.

I have also seen the panic mentioned earlier though (https://github.com/vitessio/vitess/issues/12067) but have yet to track it down.

Apologies for not being more helpful! It's possible I implemented one of the database/sql interfaces in a suboptimal way though the original issue here does seem to predate my changes as well.

mdlayher avatar Dec 21 '23 13:12 mdlayher

Thanks for the info, that's still helpful. The issue definitely predated your changes, and we didn't notice any change afterwards, so I'm confident that wasn't a contributor.

I'm suspicious of these lines, though I haven't found anything concrete yet. I'm guessing that not many people are passing in dial options like we are, which may be contributing to us seeing different behavior. https://github.com/vitessio/vitess/blob/54f0b33a652f929ecc43a949371d49364a3b94c8/go/vt/vitessdriver/driver.go#L110-L112

Our usage

	return vitessdriver.Configuration{
		Protocol:  "grpc",
		Address:   db.address,
		Target:    db.target,
		Streaming: db.streaming,

		DriverName: db.driver,

		// set dial options with max message sizes
		GRPCDialOptions: []grpc.DialOption{
			grpc.WithDefaultCallOptions(
				grpc.MaxCallRecvMsgSize(ngrpc.MaxMsgSize),
				grpc.MaxCallSendMsgSize(ngrpc.MaxMsgSize),
			),
			grpc.WithKeepaliveParams(keepalive.ClientParameters{
				Time:                300 * time.Second,
				Timeout:             600 * time.Second,
				PermitWithoutStream: true,
			}),
		},
	}

derekperkins avatar Dec 21 '23 15:12 derekperkins

Here's my Go code:

	return vitessdriver.Configuration{
		Protocol: fmt.Sprintf("grpc-%s", protocolID),
		// Specify a keyspace and cluster type for the vtgate session.
		Target: fmt.Sprintf("%s@%s", keyspace, cluster),
		Address: "xxx",
		GRPCDialOptions: append(
			[]grpc.DialOption{
				grpc.WithPerRPCCredentials(&grpcclient.StaticAuthClientCreds{
					Username: "xxx",
					Password: "xxx",
				}),
			},
			// mTLS secures the underlying transport.
			grpc.WithTransportCredentials(insecure.NewCredentials()),
		),
	}, nil

And later:

	cfg.GRPCDialOptions = append(cfg.GRPCDialOptions,
		grpc.WithContextDialer(func(ctx context.Context, addr string) (net.Conn, error) {
			return tunnel.DialContext(ctx, "tcp", addr)
		}),
	)

mdlayher avatar Dec 21 '23 15:12 mdlayher