neon icon indicating copy to clipboard operation
neon copied to clipboard

Big insert effectively prevents any getPage requests

Open kelvich opened this issue 3 years ago • 9 comments

If we run insert in one session:

create table t2 as select generate_series(1,10000000000);

and concurrently try get some uncached pages from the pageserver:

\d+

second backend will fail with:

ERROR:  could not read block 0 in rel 1663/16385/1255.0 from page server at lsn 0/3E5FC238
DETAIL:  page server returned error: Timed out while waiting for WAL record at LSN 0/3E5FC238 to arrive, last_record_lsn 0/1043E8D8 disk consistent LSN=0/1695D48

kelvich avatar Feb 04 '22 20:02 kelvich

All back-pressure thresholds are disabled. Isn't it expected behavior in this case?

knizhnik avatar Feb 04 '22 21:02 knizhnik

cc @ololobus https://github.com/zenithdb/console/issues/562

kelvich avatar Feb 04 '22 23:02 kelvich

Yeah, will do it asap

ololobus avatar Feb 04 '22 23:02 ololobus

Backpressure should be turned on on both staging and prod, so it's worth trying similar test again

ololobus avatar Feb 06 '22 01:02 ololobus

I can still reproduce this. My hunch is lock starvation in the pageserver, but needs to be investigated.

hlinnaka avatar Feb 22 '22 15:02 hlinnaka

As it was written in comment to backpressure settings, WAL replay speed is about 10Mb/sec. Current default setting for "max_replication_write_lag" is 500Mb. Which should provide maximal wait lsn time about 50 sec (< 60 sec). But actually with safekeeper at my notebook speed is slower and 500MB write_lag really timeout expiration and so read errors. So to prevent noticeable delays (> 1 seconds) we should not use write_lag > 10Mb. At least in production. I can create PR for it, changing default value in compute.rs. But console has its own limits...

knizhnik avatar Mar 01 '22 16:03 knizhnik

could you help reproduce and propose the solution, @MMeent ?

stepashka avatar Apr 14 '22 09:04 stepashka

@jcsp is this still the case after the batch ingestion contributer pr ?

shanyp avatar Dec 26 '23 11:12 shanyp

The original report is from almost two years ago, so I don't think we can say much without re-testing this.

jcsp avatar Jan 02 '24 09:01 jcsp