neon Neon failed to restart after disk space exhaustion

Neon failed to restart after disk space exhaustion

Open knizhnik opened this issue 2 years ago • 1 comments

Steps to reproduce

Upload data to Neon until disk space is exhausted (tmpfs can be used for it) Then free some space and try to restart server.

Expected result

Server is normally restarted

Actual result

2022-09-19 18:27:34.527 GMT [143496] LOG: [ZENITH] found 'zenith.signal' file. setting prev LSN to 0/0 2022-09-19 18:27:34.527 GMT [143496] LOG: database system was shut down at 2022-09-19 16:47:44 GMT 2022-09-19 18:27:34.527 GMT [143496] LOG: starting with zenith basebackup at LSN 18/E3BABFD8, prev 0/0 2022-09-19 18:27:34.527 GMT [143496] FATAL: cannot start in read-write mode from this base backup 2022-09-19 18:27:34.527 GMT [143495] LOG: startup process (PID 143496) exited with exit code 1 2022-09-19 18:27:34.527 GMT [143495] LOG: aborting startup due to startup process failure 2022-09-19 18:27:34.528 GMT [143495] LOG: database system is shut down

zenith.signal file contains PREV LSN: invalid

Environment

EC2 node with Debian Linux

Logs, links

Sep 20 '22 07:09 knizhnik

Prev LSN tracking is very tricky inconvenient and error prone. See discussion here: https://app.slack.com/client/T026T3BRN0P/C033RQ5SPDH/thread/C033RQ5SPDH-1657308972.924859?cdn_fallback=1 If we just use ControlFile->checkPointCopy.redo - 8 as prev LSN and do not set zenithWriteOk = false, then server is normally started

Sep 20 '22 07:09 knizhnik

IIRC, we support even graceful shutdown and restart of the compute node. You need to always fetch a new basebackup and re-initialize.

Oct 01 '22 08:10 hlinnaka

But I have expected that neon_local pg start main actually does it (restore compute node from backup), doesn't it? Please notice that i my test Neon is launched locally at one server. So disk space exhaustion affect all: compute node, pageserver and safekeepers.

I just want to repeat the sequence of actions:

Data is inserted in database
Disk space exhausted
neon_local stop
Free some space on the disk
neon_local start
neon_local pg start main

Oct 01 '22 11:10 knizhnik

neon neon copied to clipboard

Neon failed to restart after disk space exhaustion

Steps to reproduce

Expected result

Actual result

Environment

Logs, links

neon
neon copied to clipboard