materialize icon indicating copy to clipboard operation
materialize copied to clipboard

replica isolation: multiple replicas cause degradation in performance of SELECT statements due to conflicts in Postgres

Open philip-stoev opened this issue 2 years ago • 13 comments

What version of Materialize are you using?

v0.26.1-dev (df970e97b)

How did you install Materialize?

Built from source

What is the issue?

The presence of multiple replicas causes SELECT statements to take ~5 seconds, likely due to contention on the Postgres side. The error log has this:

materialized: 2022-06-08T10:36:01.503433Z  INFO mz_persist_client::r#impl::machine: external operation apply_unbatched_cmd::cas failed, retrying in 32ms: determinate: db error: ERROR: could not serialize access due to read/write dependencies among transactions
DETAIL: Reason code: Canceled on commit attempt with conflict in from prepared pivot.
HINT: The transaction might succeed if retried.: ERROR: could not serialize access due to read/write dependencies among transactions
DETAIL: Reason code: Canceled on commit attempt with conflict in from prepared pivot.
HINT: The transaction might succeed if retried.

To reproduce:

\timing on
drop cluster default cascade;
create cluster default replicas();
create table t1 (f1 integer);
insert into t1 values (5);
create cluster replica default.replica1 SIZE '1';
select * from t1;
select * from t1;
select * from t1;
select * from t1;
select * from t1;
select * from t1;
select * from t1;
select * from t1;
select * from t1;
select * from t1;
create cluster replica default.replica2 SIZE '1';
create cluster replica default.replica3 SIZE '1';
create cluster replica default.replica4 SIZE '1';
create cluster replica default.replica5 SIZE '1';
create cluster replica default.replica6 SIZE '1';
create cluster replica default.replica7 SIZE '1';
create cluster replica default.replica8 SIZE '1';
create cluster replica default.replica9 SIZE '1';
create cluster replica default.replica10 SIZE '1';
create cluster replica default.replica11 SIZE '1';
create cluster replica default.replica12 SIZE '1';
create cluster replica default.replica13 SIZE '1';
create cluster replica default.replica14 SIZE '1';
create cluster replica default.replica15 SIZE '1';
create cluster replica default.replica16 SIZE '1';
create cluster replica default.replica17 SIZE '1';
create cluster replica default.replica18 SIZE '1';
create cluster replica default.replica19 SIZE '1';
create cluster replica default.replica20 SIZE '1';
select * from t1;
select * from t1;
select * from t1;
select * from t1;
select * from t1;
select * from t1;
select * from t1;
select * from t1;
select * from t1;
select * from t1;

Relevant log output

No response

philip-stoev avatar Jun 08 '22 10:06 philip-stoev

this concerns me! some unstructured thoughts:

  • 20 replicas feels like it should be fine. i'd expect to get closer to 50-100 given the current pre-M2 architecture
  • the fact that this log switched from DEBUG to INFO (and thus actually showed up) means it's already on the 4th retry
  • now that we have tracing coverage of persist, it should be easy to see concretely what's going on
  • I suspect the currently unsolved advance_local_inputs problem is already putting us at a disadvantage here. possibly, increasing the timestamp interval for tables would help as a short-term patch, but then it'd be hard to tell if SELECT latencies were the original problem or the slower timestamp interval
  • post M1, we're going to move to a model where the replicas don't talk directly to persist. instead storage holds back compaction based on the progress messages it's getting over rpc from compute. that should eliminate this problem
  • but still, i'm surprised we're falling over on 20 replicas

cc @aljoscha

danhhz avatar Jun 08 '22 14:06 danhhz

Does this reproduce on today's main? I tried it on my local machine but performance seems fine. There have been some changes that could affect this, of course. For example: #12997

aljoscha avatar Jun 09 '22 07:06 aljoscha

One could say that there has been some improvement, but this is still reproducible to me. Once all the replicas have been fired up, the SELECT starts taking 1-2 seconds, which is 1-2 seconds too many:

materialize=> select * from t1;
 f1 
----
  5
(1 row)

Time: 2074.205 ms (00:02.074)

The log looks as it did originally:

[1655363633786] } retrying in 256ms
compute-cluster-2-replica-4: 2022-06-16T07:13:56.713055Z  INFO mz_persist_client::r#impl::machine: snapshot as of Antichain { elements: [1655363635839] } not yet available for upper Antichain { elements: [1655363633786] } retrying in 128ms
compute-cluster-2-replica-3: 2022-06-16T07:13:56.723017Z  INFO mz_persist_client::r#impl::machine: snapshot as of Antichain { elements: [1655363635839] } not yet available for upper Antichain { elements: [1655363633786] } retrying in 64ms
compute-cluster-2-replica-10: 2022-06-16T07:13:56.740959Z  INFO mz_persist_client::r#impl::machine: snapshot as of Antichain { elements: [1655363635839] } not yet available for upper Antichain { elements: [1655363633786] } retrying in 128ms
environmentd: 2022-06-16T07:13:56.864358Z  INFO mz_persist_client::r#impl::machine: external operation apply_unbatched_cmd::cas failed, retrying in 32ms: determinate: db error: ERROR: could not serialize access due to read/write dependencies among transactions
DETAIL: Reason code: Canceled on commit attempt with conflict in from prepared pivot.
HINT: The transaction might succeed if retried.: ERROR: could not serialize access due to read/write dependencies among transactions
DETAIL: Reason code: Canceled on commit attempt with conflict in from prepared pivot.
HINT: The transaction might succeed if retried.

philip-stoev avatar Jun 16 '22 07:06 philip-stoev

Are you running this on your machine or cloud?

aljoscha avatar Jun 20 '22 07:06 aljoscha

I found the culprit! We periodically update the status of compute replicas [1]. This in turn causes builtin table updates to be sent in [2], which then goes through StorageController::append(), which writes things to persist, and that puts pressure on consensus. I pushed a PR that at least helps to make this visible: https://github.com/MaterializeInc/materialize/pull/13240.

[1] https://github.com/MaterializeInc/materialize/blob/c479875b22914a2c07c30cf9ab4af29a19f28dd9/src/coord/src/coord.rs#L1005

[2] https://github.com/MaterializeInc/materialize/blob/c479875b22914a2c07c30cf9ab4af29a19f28dd9/src/coord/src/coord.rs#L5042

aljoscha avatar Jun 23 '22 15:06 aljoscha

@philip-stoev Where you running this on Kubernetes, or "locally"? Because this should only happen when running outside of Kubernetes (at least the culprit I found above).

aljoscha avatar Jun 23 '22 16:06 aljoscha

@mjibson @jkosh44 might we able to batch these up into fewer persist calls?

danhhz avatar Jun 23 '22 16:06 danhhz

might we able to batch these up into fewer persist calls

Today is the first time I've seen these ComputeInstanceStatus messages, so I don't know much about them. My read of the code produced some ideas:

  1. We have a place in coord.rs that takes off a single compute event from a future and ships it. It'd be cool if when we see a single event there, we also drain that future until it starts pending and ship all events at once.
  2. Change dataflow_client.watch_compute_services to produce all of the currently available events instead of one at a time. This is similar to the above, but we move the batching logic to the producer instead of consumer.

These both involve rust async/future/channel/cancellation stuff that someone would have to think about. @guswynn might have thoughts about which is the better choice to pursue.

maddyblue avatar Jun 23 '22 17:06 maddyblue

I was running localy

On Thu, Jun 23, 2022, 19:16 Aljoscha Krettek @.***> wrote:

@philip-stoev https://github.com/philip-stoev Where you running this on Kubernetes, or "locally"? Because this should only happen when running outside of Kubernetes (at least the culprit I found above).

— Reply to this email directly, view it on GitHub https://github.com/MaterializeInc/materialize/issues/12992#issuecomment-1164613322, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJYN2T6LSN6YD4WWG57SY3VQSEWLANCNFSM5YGBMA5A . You are receiving this because you were mentioned.Message ID: @.***>

philip-stoev avatar Jun 23 '22 17:06 philip-stoev

Today is the first time I've seen these ComputeInstanceStatus messages

oh! looks like this maybe got added by @teskje so looping him in as well

I'm not familiar enough with the coord code to have an opinion on the mechanics, but naively it seems like it should be possible to buffer them somewhere and provide them in batch to catalog_transact (which each call turns into a persist write and if i'm reading this right also a catalog write?) instead of one at a time

danhhz avatar Jun 23 '22 17:06 danhhz

Some more context in this thread: https://materializeinc.slack.com/archives/CMHDK0DK8/p1655995859311529.

I can adjust the process orchestrator to only issue status updates for newly added replicas instead of for all replicas every 5 seconds. This should fix the performance issue in dev environments. On cloud there shouldn't be an issue in the first place, unless we've misunderstood how kubernetes watches work.

We can also think about the batching that @mjibson proposed, although that might be premature optimization.

teskje avatar Jun 23 '22 18:06 teskje

ah okay, this answers a lot of questions that I had! if we're only issuing an update when a replica is added (or actually changes status) then the frequency (1 vs 5 seconds) probably doesn't matter

danhhz avatar Jun 23 '22 19:06 danhhz

I opened a PR here: https://github.com/MaterializeInc/materialize/pull/13247

I'm not sure it resolves the performance issue though. At least from my testing it doesn't. The effect is much less pronounced on my machine, but I see <20ms before adding the replicas, and >100ms after. This doesn't change with my PR, so it looks like the slowdown is caused by something else than the replica status updates. @philip-stoev can you confirm that you see the same results?

teskje avatar Jun 24 '22 07:06 teskje

I don't believe this to be a problem anymore in practice, though it's hard to assert a negative result!

For my test setup, I created a new environment in staging. I then ran a slightly updated flavor of Philip's prelude:

\timing on
drop cluster default cascade;
create cluster default replicas();
create table t1 (f1 integer);
insert into t1 values (5);
create cluster replica default.replica1 SIZE 'xsmall';
select * from t1;
...
select * from t1;

I then created the maximum number of replicas one can for a single cluster (5):

create cluster replica default.replica2 SIZE 'xsmall';
create cluster replica default.replica3 SIZE 'xsmall';
create cluster replica default.replica4 SIZE 'xsmall';
create cluster replica default.replica5 SIZE 'xsmall';

And reran the select statements many times:

select * from t1;
...
select * from t1;

While I didn't do some super exhaustive analysis, the numbers are the same before and after, all between 150-200ms, and more crucially, no aberrant spikes into the 1+s range. I created another cluster with 5 replicas and saw the same results. (Our new replica limits helps bound how hard it is to exacerbate this problem!)

To boot, we saw 0 retries, including serializability conflicts, throughout my poking at this:

Screen Shot 2022-09-21 at 4 47 45 PM

That said, the timings when I ran locally weren't great and did regress when creating more replicas. For me, I think this is because consensus either ran against my untuned Postgres, which has substantially more serializability retries / limited # of connections than CRDB does, or local single-node CRDB, where I'm running a non-arm binary on an M1 mac and it's sloooooooooooooow. I'm less concerned about this because we don't run against either of those setups in a deployed setting

@philip-stoev let me know whether we can close the issue. From what I've seen we're not getting added contention into Postgres/CRDB/persist consensus from additional replicas at this point

pH14 avatar Sep 21 '22 20:09 pH14

I'm running a non-arm binary on an M1 mac

The https://github.com/MaterializeInc/materialize/tree/main/misc/images/materialized docker image has a cockroach arm binary that works on m1 macs. You could try building and using that image directly (or use mzcompose which will use it) if you need to test with perf closer to production.

maddyblue avatar Sep 22 '22 15:09 maddyblue

The most recent crdb release candidates have an arm build, so I went ahead and ran this locally. Something wonky is still going on, though I didn't dig any deeper. With one replica I was seeing 12-17ms, with all of them it looked like

Time: 55.383 ms
Time: 38.499 ms
Time: 105.955 ms
Time: 256.743 ms
Time: 1903.088 ms (00:01.903)
Time: 42358.595 ms (00:42.359)
Time: 785.108 ms
Time: 731.183 ms
Time: 1172.000 ms (00:01.172)
Time: 1373.999 ms (00:01.374)

Full log https://gist.github.com/danhhz/f1fda576b4da7463024fda95919496ca

danhhz avatar Nov 29 '22 15:11 danhhz

I worked on reproducing this yesterday and did some additional testing.

It appears that a lot of the variance comes from elevated contention on the shard's entry in CRDB. One non-obvious source of contention is that the query will return once the first replica has finished writing its output to persist, but all the other replicas are likely still running. Running queries back-to-back, as in Dan's repro, can build up potentially hundreds of dataflows contending over the single shard to answer queries that have already had results returned to the user. This contention seems to explain the vast majority of the variance in timing in the results.

Two things I'm not yet confident in:

  • I'm not sure why even single queries (ie. not run back-to-back) still cause elevated latencies... IIUC only one CAS needs to happen for the result to be computed and written to the output shard, and I'm unclear why that first CAS would be any slower than normal.
  • I haven't repro'd anything like Dan's 42 seconds on main. (While the slowest replica can take tens of seconds if you've submitted many queries in a row, the fastest replica always seems to complete within a few seconds; the result Dan saw seems possible with this model but very unlikely.)

In either case my testing showed pretty reasonable latencies for realistic numbers of replicas, and individual CRDB latencies didn't seem obscene even when running CRDB locally. It's not clear to me whether we should count this resolved; we'll definitely continue to see pileups in testing with large numbers of replicas, but it seems unlikely that this translates to meaningful issues for production workloads.

bkirwi avatar Dec 21 '22 21:12 bkirwi

Since this issue was filed, we now have a system limit of max_replicas_per_cluster set to 5, so while we may still have theoretical contention issues with lots of replicas, we no longer have lots of replicas 😄

If we did want to reduce contention on adhoc/unindexed queries further, another option is to keep a pool of leased readers per shard, e.g. ad-hoc query arrives --> create or lease+heartbeat reader for shard from pool. This would remove the need for any CaS operations against CRDB during a read, outside of the very first query (or if 15m goes by without another one)

pH14 avatar Mar 24 '23 18:03 pH14

Alright: since this issue is substantially improved, and we have a tighter limit on the number of replicas we need to worry about in practice, I'll close this one out. (The system limits are there for this reason!)

The main bottleneck here is that opening a new reader performs a write to create the lease, and removing reader leases is tracked elsewhere in our roadmap: see #19292.

bkirwi avatar May 19 '23 19:05 bkirwi