neon
neon copied to clipboard
pageserver: Too early layer flushing
Because we do in-mem layer freezing after: https://github.com/neondatabase/neon/blob/db89b13aaa45266227b89884490c11e10abb8054/pageserver/src/tenant/timeline.rs#L1053-L1055
It can be that we end up creating flushing a layer after only one batch in the situation:
- endpoint write activity
- flush after 10min (600s)
- endpoint receives only read access or no activity, does no autovacuum
- postgres writes to WAL running transaction info, but pageserver does not save that =>
distance == 0 - can happen when the endpoint is configured not to suspend
- postgres writes to WAL running transaction info, but pageserver does not save that =>
- next day: endpoint write activity
- walreceiver_connection.rs batches the writes caused by ingestion
- we flush right away, because (the conditions are OR'd):
- distance < checkpoint_distance most likely (unless there were a lot of read-only activity)
- (size limit is not reached)
distance > 0and time is way past 600s
Instead what we should do is to count the seconds from this first accepted ingested batch, but sadly that is kept a record over at Timeline not walreceiver.
This is something we probably talked with @problame last week, but I don't think either of us has a bug open on.
I believe this was fixed by https://github.com/neondatabase/neon/pull/6661. @koivunej do you agree?
Yes, forgot to link it with the PR... Well, I'll check in tomorrow actually.
The condition after #6661 are:
- distance from last is enough
- size would be greater than limit
- non-zero distance AND enough time
As such I don't see how the above could fix this issue, assuming a situation:
- freeze or OpenLayerAction::Roll
- write
Timeline::last_freeze_at
- write
- quiet for more than 600s
- TimelineWriterState is dropped due to safekeepers reporting "caught up"
- basebackup
- wal is produced again
- new sk connection
- new connection => no cached state => OpenLayerAction::Open
- we read + cache the
Timeline::last_freeze_at - we
InMemoryLayer::put_valuethe new value(s)
- on the next value OpenLayerAction::Roll because
- distance > 0 -- unsure if this is non-zero only after the first batch?
- timeout has already passed
When the distance eventually becomes non-zero at the end of first batch (?) on the second batch we will find that distance > 0 and timeout is way past. So I think the original problem still stands.
How to fix it in my mind, compare first write to an open inmemory layer instead of last flush?
Yeah, I agree. The issue still stands.
How to fix it in my mind, compare first write to an open inmemory layer instead of last flush?
Not sure I understand the suggestion, but I also don't know what behaviour we want here. If we actually wanted to respect the checkpoint timeout, then we'd have a background task that wakes periodically and requests a flush if the last freeze was too long ago.
I should had edited that :)
Yeah, I agree. The issue still stands.
How to fix it in my mind, compare first write to an open inmemory layer instead of last flush?
Not sure I understand the suggestion, but I also don't know what behaviour we want here. If we actually wanted to respect the checkpoint timeout, then we'd have a background task that wakes periodically and requests a flush if the last freeze was too long ago.
So instead of comparing to when did we last flush, we should compare to when was the first write to the in-memory layer written at (or, when was it opened). This would be a trivial fix, but testing it is a bit more involved.. Perhaps just a delay would work.
We discussed this and concluded that we need a background task to periodically do the check. Where "check" is:
compare to when was the first write to the in-memory layer written at (or, when was it opened) [with the checkpoint timeout]
If we actually wanted to respect the checkpoint timeout, then we'd have a background task that wakes periodically and requests a flush if the last freeze was too long ago.
We discussed this also. As far as implementation goes, perhaps this timing should be in walreceiver even though it seems like an odd place to put it. If we were to push it to whatever spawning loop, we would have to poll for the state of the open layer, or send a message from walreceiver.
walreceiver is already handling the reconnections to safekeepers, so it might be more natural place.
Realized we might in the communication between sk -> ps there might be a way to send like zero sized payload, as to remind about this. I did not check if there is one.
We now have background layer rolling, so just need to adjust the logic for time-based rolling. Tracking that in https://github.com/neondatabase/neon/issues/7241