neon icon indicating copy to clipboard operation
neon copied to clipboard

pageserver: Too early layer flushing

Open koivunej opened this issue 1 year ago • 7 comments

Because we do in-mem layer freezing after: https://github.com/neondatabase/neon/blob/db89b13aaa45266227b89884490c11e10abb8054/pageserver/src/tenant/timeline.rs#L1053-L1055

It can be that we end up creating flushing a layer after only one batch in the situation:

  1. endpoint write activity
  2. flush after 10min (600s)
  3. endpoint receives only read access or no activity, does no autovacuum
    • postgres writes to WAL running transaction info, but pageserver does not save that => distance == 0
    • can happen when the endpoint is configured not to suspend
  4. next day: endpoint write activity
  5. walreceiver_connection.rs batches the writes caused by ingestion
  6. we flush right away, because (the conditions are OR'd):
    • distance < checkpoint_distance most likely (unless there were a lot of read-only activity)
    • (size limit is not reached)
    • distance > 0 and time is way past 600s

Instead what we should do is to count the seconds from this first accepted ingested batch, but sadly that is kept a record over at Timeline not walreceiver.

This is something we probably talked with @problame last week, but I don't think either of us has a bug open on.

koivunej avatar Feb 05 '24 12:02 koivunej

I believe this was fixed by https://github.com/neondatabase/neon/pull/6661. @koivunej do you agree?

VladLazar avatar Feb 21 '24 18:02 VladLazar

Yes, forgot to link it with the PR... Well, I'll check in tomorrow actually.

koivunej avatar Feb 21 '24 20:02 koivunej

The condition after #6661 are:

  1. distance from last is enough
  2. size would be greater than limit
  3. non-zero distance AND enough time

As such I don't see how the above could fix this issue, assuming a situation:

  1. freeze or OpenLayerAction::Roll
    • write Timeline::last_freeze_at
  2. quiet for more than 600s
    • TimelineWriterState is dropped due to safekeepers reporting "caught up"
  3. basebackup
  4. wal is produced again
    • new sk connection
    • new connection => no cached state => OpenLayerAction::Open
    • we read + cache the Timeline::last_freeze_at
    • we InMemoryLayer::put_value the new value(s)
  5. on the next value OpenLayerAction::Roll because
    • distance > 0 -- unsure if this is non-zero only after the first batch?
    • timeout has already passed

When the distance eventually becomes non-zero at the end of first batch (?) on the second batch we will find that distance > 0 and timeout is way past. So I think the original problem still stands.

How to fix it in my mind, compare first write to an open inmemory layer instead of last flush?

koivunej avatar Feb 22 '24 12:02 koivunej

Yeah, I agree. The issue still stands.

How to fix it in my mind, compare first write to an open inmemory layer instead of last flush?

Not sure I understand the suggestion, but I also don't know what behaviour we want here. If we actually wanted to respect the checkpoint timeout, then we'd have a background task that wakes periodically and requests a flush if the last freeze was too long ago.

VladLazar avatar Feb 22 '24 13:02 VladLazar

I should had edited that :)

Yeah, I agree. The issue still stands.

How to fix it in my mind, compare first write to an open inmemory layer instead of last flush?

Not sure I understand the suggestion, but I also don't know what behaviour we want here. If we actually wanted to respect the checkpoint timeout, then we'd have a background task that wakes periodically and requests a flush if the last freeze was too long ago.

So instead of comparing to when did we last flush, we should compare to when was the first write to the in-memory layer written at (or, when was it opened). This would be a trivial fix, but testing it is a bit more involved.. Perhaps just a delay would work.

koivunej avatar Feb 22 '24 14:02 koivunej

We discussed this and concluded that we need a background task to periodically do the check. Where "check" is:

compare to when was the first write to the in-memory layer written at (or, when was it opened) [with the checkpoint timeout]

VladLazar avatar Feb 22 '24 16:02 VladLazar

If we actually wanted to respect the checkpoint timeout, then we'd have a background task that wakes periodically and requests a flush if the last freeze was too long ago.

We discussed this also. As far as implementation goes, perhaps this timing should be in walreceiver even though it seems like an odd place to put it. If we were to push it to whatever spawning loop, we would have to poll for the state of the open layer, or send a message from walreceiver.

walreceiver is already handling the reconnections to safekeepers, so it might be more natural place.

Realized we might in the communication between sk -> ps there might be a way to send like zero sized payload, as to remind about this. I did not check if there is one.

koivunej avatar Feb 23 '24 10:02 koivunej

We now have background layer rolling, so just need to adjust the logic for time-based rolling. Tracking that in https://github.com/neondatabase/neon/issues/7241

jcsp avatar Apr 05 '24 08:04 jcsp