neon pageserver: Too early layer flushing

Because we do in-mem layer freezing after: https://github.com/neondatabase/neon/blob/db89b13aaa45266227b89884490c11e10abb8054/pageserver/src/tenant/timeline.rs#L1053-L1055

It can be that we end up creating flushing a layer after only one batch in the situation:

endpoint write activity
flush after 10min (600s)
endpoint receives only read access or no activity, does no autovacuum
- postgres writes to WAL running transaction info, but pageserver does not save that => distance == 0
- can happen when the endpoint is configured not to suspend
next day: endpoint write activity
walreceiver_connection.rs batches the writes caused by ingestion
we flush right away, because (the conditions are OR'd):
- distance < checkpoint_distance most likely (unless there were a lot of read-only activity)
- (size limit is not reached)
- distance > 0 and time is way past 600s

Instead what we should do is to count the seconds from this first accepted ingested batch, but sadly that is kept a record over at Timeline not walreceiver.

This is something we probably talked with @problame last week, but I don't think either of us has a bug open on.

Feb 05 '24 12:02 koivunej

I believe this was fixed by https://github.com/neondatabase/neon/pull/6661. @koivunej do you agree?

Feb 21 '24 18:02 VladLazar

Yes, forgot to link it with the PR... Well, I'll check in tomorrow actually.

Feb 21 '24 20:02 koivunej

The condition after #6661 are:

distance from last is enough
size would be greater than limit
non-zero distance AND enough time

As such I don't see how the above could fix this issue, assuming a situation:

freeze or OpenLayerAction::Roll
- write Timeline::last_freeze_at
quiet for more than 600s
- TimelineWriterState is dropped due to safekeepers reporting "caught up"
basebackup
wal is produced again
- new sk connection
- new connection => no cached state => OpenLayerAction::Open
- we read + cache the Timeline::last_freeze_at
- we InMemoryLayer::put_value the new value(s)
on the next value OpenLayerAction::Roll because
- distance > 0 -- unsure if this is non-zero only after the first batch?
- timeout has already passed

When the distance eventually becomes non-zero at the end of first batch (?) on the second batch we will find that distance > 0 and timeout is way past. So I think the original problem still stands.

How to fix it in my mind, compare first write to an open inmemory layer instead of last flush?

Feb 22 '24 12:02 koivunej

Yeah, I agree. The issue still stands.

How to fix it in my mind, compare first write to an open inmemory layer instead of last flush?

Not sure I understand the suggestion, but I also don't know what behaviour we want here. If we actually wanted to respect the checkpoint timeout, then we'd have a background task that wakes periodically and requests a flush if the last freeze was too long ago.

Feb 22 '24 13:02 VladLazar

I should had edited that :)

Yeah, I agree. The issue still stands.

How to fix it in my mind, compare first write to an open inmemory layer instead of last flush?

Not sure I understand the suggestion, but I also don't know what behaviour we want here. If we actually wanted to respect the checkpoint timeout, then we'd have a background task that wakes periodically and requests a flush if the last freeze was too long ago.

So instead of comparing to when did we last flush, we should compare to when was the first write to the in-memory layer written at (or, when was it opened). This would be a trivial fix, but testing it is a bit more involved.. Perhaps just a delay would work.

Feb 22 '24 14:02 koivunej

We discussed this and concluded that we need a background task to periodically do the check. Where "check" is:

compare to when was the first write to the in-memory layer written at (or, when was it opened) [with the checkpoint timeout]

Feb 22 '24 16:02 VladLazar

If we actually wanted to respect the checkpoint timeout, then we'd have a background task that wakes periodically and requests a flush if the last freeze was too long ago.

We discussed this also. As far as implementation goes, perhaps this timing should be in walreceiver even though it seems like an odd place to put it. If we were to push it to whatever spawning loop, we would have to poll for the state of the open layer, or send a message from walreceiver.

walreceiver is already handling the reconnections to safekeepers, so it might be more natural place.

Realized we might in the communication between sk -> ps there might be a way to send like zero sized payload, as to remind about this. I did not check if there is one.

Feb 23 '24 10:02 koivunej

We now have background layer rolling, so just need to adjust the logic for time-based rolling. Tracking that in https://github.com/neondatabase/neon/issues/7241

Apr 05 '24 08:04 jcsp

neon neon copied to clipboard

pageserver: Too early layer flushing

neon
neon copied to clipboard