@ars , @Joonas Koivunen and I just discovered that there are cases where timelines remain active in safekeepers because of a never-closing gap between safekeeper commit_lsn and pageserver remote_consistent_lsn.

This became apparent when analyzing reasons for spawning of initial logical size calculation tasks in Pageservers during storage deployment, using this query: ca 75% of the initial logical size calculations are initiated by WalReceiverConnectionHandler, which doesn't make sense because we know that much less than 75% of tenants are truly active[^explainer_initial_logical_size_calculation].

Other graphs we looked at during analysis (all looking at exact same time frame)

live connection count for walreceiver conns between PS and SK (gauge): Link
amount of walreceiver connection switches (better than above gauge): Link
one of the safekeepers in that region: gauge of active timelines: Link
one of the safekeeper's logs, logging timeline state transitions: Link

[^explainer_initial_logical_size_calculation]: initial logical size calculation is initiated either from walreceiver connection handler, or by consumption metrics. See also #5955 .

Repro

You can repro it locally quite easily:

neon_local init
neon_local start
neon_local tenant create --set-default
neon_local endpoint create
neon_local endpoint start main
neon_local endpoint stop main
neon_local stop
neon_local start
# safeekeeper is now at a commit_lsn > pageserver's remote_consistent_lsn
# pageserver will never flush a layer

We did some digging with trace-level logging:

RUST_LOG=pageserver::walrecord=trace,pageserver::tenant::timeline::walreceiver=trace,pageserver::tenant::storage_layer::inmemory_layer=trace,pageserver::walingest=trace,debug ./target/debug/neon_local pageserver restart

PS log

2023-11-28T18:16:42.929912Z TRACE wal_connection_manager{tenant_id=8bfbde11eed5bf1e46ab031ea0a3f5c6 timeline_id=a87acb3253cfa652007a7e0404c1e726}:connection{node_id=1}: received XLogData between 0/149F0D8 and 0/149F1B8
2023-11-28T18:16:42.930139Z TRACE wal_connection_manager{tenant_id=8bfbde11eed5bf1e46ab031ea0a3f5c6 timeline_id=a87acb3253cfa652007a7e0404c1e726}:connection{node_id=1}: decode_wal_record xl_rmid = 0 xl_info = 96
2023-11-28T18:16:42.930390Z TRACE wal_connection_manager{tenant_id=8bfbde11eed5bf1e46ab031ea0a3f5c6 timeline_id=a87acb3253cfa652007a7e0404c1e726}:connection{node_id=1}: decode_wal_record xl_rmid = 8 xl_info = 16
2023-11-28T18:16:42.930465Z TRACE wal_connection_manager{tenant_id=8bfbde11eed5bf1e46ab031ea0a3f5c6 timeline_id=a87acb3253cfa652007a7e0404c1e726}:connection{node_id=1}: decode_wal_record xl_rmid = 0 xl_info = 96
2023-11-28T18:16:42.930496Z TRACE wal_connection_manager{tenant_id=8bfbde11eed5bf1e46ab031ea0a3f5c6 timeline_id=a87acb3253cfa652007a7e0404c1e726}:connection{node_id=1}: decode_wal_record xl_rmid = 8 xl_info = 16
2023-11-28T18:16:42.930523Z  INFO wal_connection_manager{tenant_id=8bfbde11eed5bf1e46ab031ea0a3f5c6 timeline_id=a87acb3253cfa652007a7e0404c1e726}:connection{node_id=1}: caught up at LSN 0/149F1B8

We would have expected the trace-level logging from https://github.com/neondatabase/neon/blob/dc725672882e6c0ae2e5afbc87ac00829a4d8ffe/pageserver/src/tenant/storage_layer/inmemory_layer.rs#L260

but didn't.

Analysis / Interpretation Of The Logs

These WAL records are no-ops to Pageservers, i.e., we don't put() any key-value-pair into the Timeline in response to these records, i.e., we don't create an open layer, i.e., check_checkpoint_distance() doesn't have anything to flush.

Fixing This

TBD

Related tasks

# Understand The Scope Of The Issue
- [ ] https://github.com/neondatabase/neon/pull/6002
- [ ] pageserver metrics in pageserver to quantify the condition in prod

# Better Observability In Safekeepers
- [ ] have counters for state transitions of sk timeline states instead of gauge `safekeeper_timeline_active`

Nov 28 '23 18:11 problame

I guess we need to create an empty delta layer to cover those records..

Nov 30 '23 16:11 hlinnaka

If we already support filtering of WAL records on Safekeepers (for sharding), then why not to filter this no-ops WAL records and do not send them to PS at all? We definitely need them for replication, but replication is done by safekeepers...

Dec 03 '23 18:12 knizhnik

IDK about @hlinnaka 's idea with the empty delta layer as impacts compaction

I kinda like Konstantin's idea of filtering out no-op records in SK.

I'm reading between the lines that Konstantin's idea implies that SKs will keep track of last non-noop LSN and the criterion for "PS has caught up" is "is last noop LSN == remote_consistent_lsn".

I think if we go with filtering in SKs, we should also have a corresponding change in PSes that makes noise or outright rejects no-op records, so, we'd be alerted if we introduce any regressions. PR https://github.com/neondatabase/neon/pull/6002 is a step in that direction.

Dec 04 '23 14:12 problame

@hlinnaka and I decided to pause this effort until https://github.com/neondatabase/neon/issues/5993 is fixed and shipped. That will give us an idea of how much this issue contributes to the 75% number mentioned in the issue description.

Dec 04 '23 14:12 problame

Triage notes:

Christian's latest update is current

Dec 15 '23 10:12 jcsp

#5993 has been rolled out and its impact been reviewed.

I think we don't have a direct metric to measure the impact of this bug.

Instead of using indirect metric like comparing pageserver initiated walreceiver connections with safekeeper active timelines during restart, let's have a direct metric.

Relatively low effort proposal: push https://github.com/neondatabase/neon/pull/6002 over the finish line (it's a good change anyway), then add a counter metric for IngestRecordOutcome::Noop.

Watch that counter during restart.

In order to get number of affected timelines, have a counter that is only incremented once per timeline struct.

Jan 03 '24 18:01 problame

It turns out we have the pageserver_wal_ingest_records_received metric now. While it isn't a counter for just ::Noop but all ingestions, we can still use it to see whether there's a peak in ingestions during PS startup.

Dashboard pinned to this week's deploy in us-west-2

=> Eyeballing it for us-west-2, there are no abnormal spikes during PS restart.

My take-away from that is that this issue isn't wide-spread.

With regard to this issue's relevance for #5479 , remember the main worry was that this bug would trigger a lot of initial logical size calculations.

Looking at the increase of metric pageserver_initial_logical_size_timelines_where_walreceiver_got_approximate_size after restarts provides a conservative approximation for how bad the remaining impact is. Why is it a conservative approximation?

PS startup triggers initial size calculations eagerly in the background.
So, every initial size calculation that PS startup does can't be triggered by truly SK-active timelines or timelines that are SK-active because of this issue's bug.

Above dashboard includes a row that shows the numbers. There's less than 300 occurrences total per instance instance in us-west-2, and these 300 are spread over multiple minutes. So, in summary: quite a low overall rate of these calculations, compared to the total amount of initial logical size calculations we do during startup (=> yet another row in that dashboard).

Should be safe to ship #5479 without fixing this bug.

Jan 04 '24 15:01 problame

Some notes from the April storage offsite:

@arssher sees 2x more active timelines than there are connections from compute (=> these are either due to S3 uploads or due to this issue, needs more analysis)
it is probably sufficient to advance remote_consistent_lsn even for no-op records
- basically needs InMemoryLayer::put_noop() that advances the layer's end_lsn , then when that InMemoryLayer gets frozen & flushed, the end_lsn will eventually end up as IndexPart::disk_consistent_lsn, and that's remote_consistent_lsn moving forward
@jcsp tunes in: sharding solved this, but it's behind a check that only applies to sharded tenants
- for the case where there IS an InMemoryLayer open, it's properly solved
- for the case where there is NO InMemoryLayer, it's solved using a periodic tick
  - this only updates the local metadata but doesn't schedule an upload
  - there is another background thing that schedules an index upload at some interval

Apr 12 '24 12:04 problame

Changes that upload remote index proactively in some cases -- this was for sharding, might not covery every case where a timeline ingest goes idle https://github.com/neondatabase/neon/blob/221414de4b0260056e0961528d46c5141825a0a0/pageserver/src/tenant/timeline.rs#L1202-L1226

Apr 12 '24 12:04 jcsp

Currently advancing the disk consistent lsn is behind a condition for whether we're sharded, which can be safely removed (the condition was just to de-risk the change): https://github.com/neondatabase/neon/commit/ac7fc6110bba250f17b494c604b717cf69e09ef1

Apr 12 '24 12:04 jcsp

neon
neon copied to clipboard

WAL records that are no-ops cause timelines to remain active

Repro

Analysis / Interpretation Of The Logs

Fixing This

Related tasks

neon neon copied to clipboard

WAL records that are no-ops cause timelines to remain active

Repro

Analysis / Interpretation Of The Logs

Fixing This

Related tasks

neon
neon copied to clipboard