pageserver: `sync` on startup (eliminate bug class of unclear durability after `abort()`)

Open problame opened this issue 1 year ago • 1 comments

If we abort() because of some filesystem error, systemd restarts the pageserver. We might then read data from the file system and make decisions based on those reads, incorrectly assuming the data is durable, "because we just read it".

A simple mechanism to prevent this mistake is to sync() the /storage partition during pageserver startup, before we read from it.

If the sync() fails, we abort() (again).

Note that this issue generally doesn't matter that much nowadays because we treat remote storage as the absolute authority (#5198). Note again however, that fsync() between writes to the same file are still important, because we only rely on file length instead of content checksums.

Tasks

### Tasks
- [ ] https://github.com/neondatabase/neon/pull/8835
- [ ] wait for rollout
- [ ] observe log output

Mar 01 '24 14:03 problame

Calling sync before cleaning up ephemeral layers is somewhat wasteful
However, ephemeral layer cleanup happens relatively late in startup, after loading timeline metadata.
This should only be a problem if linux was leaving really large amounts of data in page cache, without draining to disk in the background, which is atypical.
We will syncfs() on startup and log how long the sync took, so that we have visibility if it's slow.

Mar 21 '24 12:03 jcsp

Query

sum_over_time({neon_service="pageserver"} |= `made tenant directory contents durable` | regexp `elapsed_ms=(?P<elapsed_ms>\d+)` |unwrap elapsed_ms [7d])

executed at time of posting after all regions have been deployed with this week's release

Top 3: 251ms, 161ms, 95ms

Full query for posterity

Sep 05 '24 10:09 problame