dolt icon indicating copy to clipboard operation
dolt copied to clipboard

Sporadic slowness in GCS read replica

Open zachmu opened this issue 3 years ago • 3 comments

Customer observed very significant added latency when querying a read replica which used a GCS bucket as its remote. I was able to reproduce this problem, but only intermittently. When it happened, the added latency was very significant, on the order of 8+ seconds. Latency was present even when the replica was caught up. But after getting a profiler attached the problem vanished.

This needs additional investigation. Current hypothesis is that tail latency on GCS is much higher than anticipated, and might be triggered by recent writes to files like the manifest.

Experimental setup:

% dolt config --list --global | grep sqlserver
sqlserver.global.dolt_replicate_all_heads = 1
sqlserver.global.dolt_replication_remote_url_template = gs://zachmu-test/dolt-remotes/{database}
sqlserver.global.dolt_read_replica_remote = origin

Separately, I pushed a database to the remote indicated. Then I started a server with dolt sql-server and timed MySQL client reads against it:

mysql -h 127.0.0.1 -P3306 -uroot db10 -e "select * from t1"  0.00s user 0.02s system 0% cpu 10.377 total
mysql -h 127.0.0.1 -P3306 -uroot db10 -e "select * from t1"  0.00s user 0.02s system 0% cpu 8.021 total

Outside these latency spikes, latency added by replication was much smaller, <200 ms at most

mysql -h 127.0.0.1 -P3306 -uroot db10 -e "select * from t1"  0.00s user 0.02s system 6% cpu 0.230 total

To further debug this problem: instrument the code to measure RPC latency to GCS, the run a primary / replica pair, with the primary doing periodic pushes and the replica doing periodic reads. Use this to establish a latency distribution for replica reads.

zachmu avatar Nov 23 '22 19:11 zachmu

Same customer reports consistent slowness on a read replica backed by GCS. Database in question has 1200 branches.

Configuration:

| Variable_name                        | Value  |
+--------------------------------------+--------+
| dolt_allow_commit_conflicts          | 0      |
| dolt_async_replication               | 0      |
| dolt_force_transaction_commit        | 0      |
| dolt_read_replica_force_pull         | 0      |
| dolt_read_replica_remote             | origin |
| dolt_replicate_all_heads             | 1      |
| dolt_replicate_heads                 |        |
| dolt_replicate_to_remote             |        |
| dolt_replication_remote_url_template |        |
| dolt_skip_replication_errors         | 0      |
| dolt_transaction_commit              | 0      |
| dolt_transactions_disabled           | 0      |
+--------------------------------------+--------+

zachmu avatar Dec 06 '22 22:12 zachmu

I ran a quick remotesrv repro looping over:

  1. make $step branches (same commit as main)
  2. run time dolt sql -q "select * from dolt_branches where name = 'branch$((($i+1)*$step))'"

We average the latency of 10 queries for a remote with 100, 200, ...1000 branches:

branches vs  avg latency (1)

-- 100 branches
user	0m0.088s
user	0m0.085s
user	0m0.086s
user	0m0.095s
user	0m0.088s
user	0m0.084s
user	0m0.086s
user	0m0.084s
user	0m0.085s
user	0m0.084s

-- 200 branches
user	0m0.134s
user	0m0.142s
user	0m0.134s
user	0m0.133s
user	0m0.140s
user	0m0.138s
user	0m0.134s
user	0m0.134s
user	0m0.133s
user	0m0.131s

-- 300 branches 
user	0m0.202s
user	0m0.188s
user	0m0.186s
user	0m0.187s
user	0m0.187s
user	0m0.189s
user	0m0.192s
user	0m0.186s
user	0m0.190s
user	0m0.187s

...

-- 900 branches
user	0m0.828s
user	0m0.830s
user	0m0.853s
user	0m0.824s
user	0m0.840s
user	0m0.831s
user	0m0.848s
user	0m0.841s
user	0m0.840s
user	0m0.796s

-- 1000 branches
user	0m1.385s
user	0m1.164s
user	0m0.970s
user	0m0.980s
user	0m0.961s
user	0m1.018s
user	0m0.967s
user	0m0.975s
user	0m1.113s
user	0m1.074s

max-hoffman avatar Dec 08 '22 00:12 max-hoffman

I've collected evidence suggesting 1) we execute a number of remote fetches proportional to the number of branches, 2) for static replicas with 1000 branches on a local remote server, we reduce read latency by 8-10x by batching those fetches. Working on a PR now.

max-hoffman avatar Dec 09 '22 01:12 max-hoffman

This PR should substantially address these latency problems, now consistently <100ms when the replica is already caught up.

https://github.com/dolthub/dolt/pull/5005

zachmu avatar Dec 16 '22 22:12 zachmu