neon Epic: resolve the pageserver backpressure problems

Epic: resolve the pageserver backpressure problems

Open kelvich opened this issue 2 years ago • 13 comments

The current plan is:

[x] tweak the backpressure settings (already on staging) (this was reverted)
[x] tweak the backpressure settings again on staging
[x] tweak the backpressure settings on production once the fix is confirmed on staging
[ ] bring back the patches from Konstantin

Ortiginally we had this plan:

We have quite a lot of issues about backpressure and several attempts to fix it. ISTM that there are two most actionable things that we can do:

[ ] implement time-based backpressure
[ ] write down doc on how it works right now

List of backpressure-related tasks, if we can resolve a half, we'll be happy!:

[ ] https://github.com/neondatabase/neon/issues/596
[ ] https://github.com/neondatabase/neon/issues/1207
[ ] https://github.com/neondatabase/neon/issues/1263
[ ] https://github.com/neondatabase/neon/issues/1587
[ ] https://github.com/neondatabase/neon/issues/1763
[ ] https://github.com/neondatabase/neon/issues/2023
[ ] https://github.com/neondatabase/neon/issues/1409
[ ] https://github.com/neondatabase/neon/issues/1773
[ ] https://github.com/neondatabase/neon/issues/1409

Jul 05 '22 08:07 kelvich

So my understanding of the current status is the following:

100MB write_replication_lag is too much: it cause minutes delays and wait_for_lsn timeout expiration.
Reducing write_replication_lag to 10MB mostly eliminate large delay problem (it is within few seconds). But at some cases it shows significant (~2 times) slowdown of insertion comparing with 100MB lag. This is why I tried to implement some other throttling strategies other that stop-and-wait
@aome510 implemented exhausted test for backpressure which demonstrates few things:

There is no big difference in write speed with different write_lag values
Results greatly depends on target system IO system. Particularly performance on EC2 servers is much higher than on laptop.
Some pageserver operations (like flushing open layers) cause write storm which cause several seconds delays of all IO operations (including reads). Backpressure can not protect us fro such delays.

There is an alternative to backpressure which can reduce select queries latency - more precisely calculate last written LSN for the particular page (relation/chunk). I have PR for it.

So based on the results of test_wal_backpressure test I have made the following conclusions:

Reducing write_lag to 10MB is enough to avoid too larger delays
Write speed is not significantly affected by reducing write_lag, so it seems to be not so critical right now to change throttling algorithm. We still can use naive stop-and-wait.
More precise calculation of last_written_lsn (maintaining global cache for it) is good idea and help to significantly increase performance.

As far as changing default max_replication_write_lag to 10MB is already merged in main, I think that the only thing we should do with backpressure in July is to review and commit PR https://github.com/neondatabase/postgres/pull/177

Jul 05 '22 14:07 knizhnik

As far as changing default max_replication_write_lag to 10MB is already merged in main, ...

Oh, can we close https://github.com/neondatabase/neon/pull/1793 then?

Jul 05 '22 15:07 hlinnaka

As far as changing default max_replication_write_lag to 10MB is already merged in main, ...

Oh, can we close #1793 then?

O sorry, it is not yet merged. Can you add review of #1793 so that I can merge it?

Jul 05 '22 19:07 knizhnik

To summarize what we have done with the current backpressure approach,

I added backpressure tests in https://github.com/neondatabase/neon/pull/1919 that confirms the big read latency in https://github.com/neondatabase/neon/issues/1763
@knizhnik added a patch (https://github.com/neondatabase/postgres/pull/177) adding LSN cache to mitigate the above issue. It was confirmed by the new tests that the maximum read latency can be reduced from about 15s to 2s with the LSN cache. I think this patch can resolve some of the issues in the Epic list involving large latencies for read request
- https://github.com/neondatabase/neon/issues/1763
- https://github.com/neondatabase/neon/issues/1207
we encountered some failed tests with Konstantin's new patch in PG main branch. Discussed in https://github.com/neondatabase/postgres/pull/182.
as @hlinnaka suggested in that PR, Konstantin created a patch (https://github.com/neondatabase/postgres/pull/183) to revert the new changes (needs reviewing)

In short, the current status is quite "messy": we did have a patch to mitigate some of the backpressure issues, but we cannot update neon to use that because of some failed tests [1].

[1]: I'm not too sure about the status of the failed tests. Are they because of flaky tests unrelated to the new changes or because of the new changes? Maybe Konstantin can provide more insights on this.

Jul 25 '22 16:07 aome510

Some more information from my side:

Even without last written LSN cache (pageserver#177) I have not seen large latencies with #1919 and #1763 tests with max-replication_write_lag=15MB and wal_log_hints=off, Second one is important because when enabled, autovacuum may produce a lot of WAL without obtaining XID and so not be blocked by GC.
I never saw (even in CI results) errors like "could not read block...". Errors wich I saw in CI are mostly related with "flaky" tests like test_wal_acceptor. At least I saw similar failures on other PRs not related with backpressure.
My concern about reducing max_replication_write_lag was that it may slow-down writers. But results of #1919 and #1763 doesn't prove it. So there is not urgent need to choose another throttling policy which can replace current stop-and-wait.

Concerning the idea to have time-based backpressure I do not think that it can some radically reduce latencies comparing with current implementation:

It assumes that size of producing and replaying WAL is the same at compute node and pageserver. But it obviously not true.
Time is not currently included in feedback message. We can certainly add then to the protocol. Or use our LSN->timestamp mapping... But I afraid that it will be too expensive.
There may be time difference between localtime at different computers.

Jul 25 '22 20:07 knizhnik

Even without last written LSN cache (pageserver#177) I have not seen large latencies with #1919 and #1763 tests with max-replication_write_lag=15MB and wal_log_hints=off, Second one is important because when enabled, autovacuum may produce a lot of WAL without obtaining XID and so not be blocked by GC.

I did disable wal_log_hints in tests but still got the similar maximum latency in https://github.com/neondatabase/neon/runs/7508653405?check_suite_focus=true.

test_pgbench_intensive_init_workload[neon_on-1000].read_latency_max: 10.495 s 18422 test_pgbench_intensive_init_workload[neon_on-1000].read_latency_avg: 6.723 s 18423 test_pgbench_intensive_init_workload[neon_on-1000].read_latency_stdev: 2.501 s

Jul 26 '22 15:07 aome510

Yes, you are right. Backpressure help to minimize delay of single get_page_at_lsn call. But if we have to fetch hundreds of pages (as in test_pgbench_intensive_init_workload performing select count(*) from foo), then delay can be really large. And the larger source table is, the larger delay will be. With none backpressure settings we can provide performance similar with vanilla if table doesn't fir in memory.

In this particular case ( test_pgbench_intensive_init_workload) last written LSN cache should definitely help.

Jul 26 '22 19:07 knizhnik

@kelvich mentioned that we may be ok with tweaking the backpressure settings 10MB or 15MB & without the immediate changes in the backpressure logic this is configured via the console and we can change it and test it everybody agreed

Aug 01 '22 15:08 stepashka

There are two bp settings max_replication_write_lag and max_replication_flush_lag, set to 500MB and 10GB now. Are we going to set both to 10MB?

Aug 01 '22 15:08 ololobus

There are two bp settings max_replication_write_lag and max_replication_flush_lag, set to 500MB and 10GB now. Are we going to set both to 10MB?

@hlinnaka @knizhnik can you clarify?

Aug 04 '22 16:08 ololobus

There are two bp settings max_replication_write_lag and max_replication_flush_lag, set to 500MB and 10GB now. Are we going to set both to 10MB?

@hlinnaka @knizhnik can you clarify?

No, just max_replication_write_lag

Aug 04 '22 18:08 knizhnik

we've done everything that we wanted for now, most of the issues should be gone, the remaining we're leaving for later (backlog)

Sep 05 '22 15:09 stepashka

Just set max_replication_write_lag to 15 MB on prod. My new and old computes started well

Sep 05 '22 15:09 ololobus

@kelvich is this something that we need to put more effort into ?

Jul 19 '23 07:07 shanyp

@shanyp to rescope this, there's still unfinished work, but we're in a different world now :)

Jul 27 '23 10:07 stepashka

Stale.

Mar 11 '24 13:03 jcsp

neon neon copied to clipboard

Epic: resolve the pageserver backpressure problems

neon
neon copied to clipboard