neon
neon copied to clipboard
Neon failed to restart after disk space exhaustion
Steps to reproduce
Upload data to Neon until disk space is exhausted (tmpfs can be used for it) Then free some space and try to restart server.
Expected result
Server is normally restarted
Actual result
2022-09-19 18:27:34.527 GMT [143496] LOG: [ZENITH] found 'zenith.signal' file. setting prev LSN to 0/0 2022-09-19 18:27:34.527 GMT [143496] LOG: database system was shut down at 2022-09-19 16:47:44 GMT 2022-09-19 18:27:34.527 GMT [143496] LOG: starting with zenith basebackup at LSN 18/E3BABFD8, prev 0/0 2022-09-19 18:27:34.527 GMT [143496] FATAL: cannot start in read-write mode from this base backup 2022-09-19 18:27:34.527 GMT [143495] LOG: startup process (PID 143496) exited with exit code 1 2022-09-19 18:27:34.527 GMT [143495] LOG: aborting startup due to startup process failure 2022-09-19 18:27:34.528 GMT [143495] LOG: database system is shut down
zenith.signal file contains PREV LSN: invalid
Environment
EC2 node with Debian Linux
Logs, links
Prev LSN tracking is very tricky inconvenient and error prone.
See discussion here:
https://app.slack.com/client/T026T3BRN0P/C033RQ5SPDH/thread/C033RQ5SPDH-1657308972.924859?cdn_fallback=1
If we just use ControlFile->checkPointCopy.redo - 8
as prev LSN and do not set zenithWriteOk = false
,
then server is normally started
IIRC, we support even graceful shutdown and restart of the compute node. You need to always fetch a new basebackup and re-initialize.
But I have expected that neon_local pg start main
actually does it (restore compute node from backup), doesn't it?
Please notice that i my test Neon is launched locally at one server. So disk space exhaustion affect all: compute node, pageserver and safekeepers.
I just want to repeat the sequence of actions:
- Data is inserted in database
- Disk space exhausted
-
neon_local stop
- Free some space on the disk
-
neon_local start
-
neon_local pg start main