neon
neon copied to clipboard
Big insert effectively prevents any getPage requests
If we run insert in one session:
create table t2 as select generate_series(1,10000000000);
and concurrently try get some uncached pages from the pageserver:
\d+
second backend will fail with:
ERROR: could not read block 0 in rel 1663/16385/1255.0 from page server at lsn 0/3E5FC238
DETAIL: page server returned error: Timed out while waiting for WAL record at LSN 0/3E5FC238 to arrive, last_record_lsn 0/1043E8D8 disk consistent LSN=0/1695D48
All back-pressure thresholds are disabled. Isn't it expected behavior in this case?
cc @ololobus https://github.com/zenithdb/console/issues/562
Yeah, will do it asap
Backpressure should be turned on on both staging and prod, so it's worth trying similar test again
I can still reproduce this. My hunch is lock starvation in the pageserver, but needs to be investigated.
As it was written in comment to backpressure settings, WAL replay speed is about 10Mb/sec. Current default setting for "max_replication_write_lag" is 500Mb. Which should provide maximal wait lsn time about 50 sec (< 60 sec). But actually with safekeeper at my notebook speed is slower and 500MB write_lag really timeout expiration and so read errors. So to prevent noticeable delays (> 1 seconds) we should not use write_lag > 10Mb. At least in production. I can create PR for it, changing default value in compute.rs. But console has its own limits...
could you help reproduce and propose the solution, @MMeent ?
@jcsp is this still the case after the batch ingestion contributer pr ?
The original report is from almost two years ago, so I don't think we can say much without re-testing this.