neon icon indicating copy to clipboard operation
neon copied to clipboard

Neon failed to restart after disk space exhaustion

Open knizhnik opened this issue 2 years ago • 1 comments

Steps to reproduce

Upload data to Neon until disk space is exhausted (tmpfs can be used for it) Then free some space and try to restart server.

Expected result

Server is normally restarted

Actual result

2022-09-19 18:27:34.527 GMT [143496] LOG: [ZENITH] found 'zenith.signal' file. setting prev LSN to 0/0 2022-09-19 18:27:34.527 GMT [143496] LOG: database system was shut down at 2022-09-19 16:47:44 GMT 2022-09-19 18:27:34.527 GMT [143496] LOG: starting with zenith basebackup at LSN 18/E3BABFD8, prev 0/0 2022-09-19 18:27:34.527 GMT [143496] FATAL: cannot start in read-write mode from this base backup 2022-09-19 18:27:34.527 GMT [143495] LOG: startup process (PID 143496) exited with exit code 1 2022-09-19 18:27:34.527 GMT [143495] LOG: aborting startup due to startup process failure 2022-09-19 18:27:34.528 GMT [143495] LOG: database system is shut down

zenith.signal file contains PREV LSN: invalid

Environment

EC2 node with Debian Linux

Logs, links

knizhnik avatar Sep 20 '22 07:09 knizhnik

Prev LSN tracking is very tricky inconvenient and error prone. See discussion here: https://app.slack.com/client/T026T3BRN0P/C033RQ5SPDH/thread/C033RQ5SPDH-1657308972.924859?cdn_fallback=1 If we just use ControlFile->checkPointCopy.redo - 8 as prev LSN and do not set zenithWriteOk = false, then server is normally started

knizhnik avatar Sep 20 '22 07:09 knizhnik

IIRC, we support even graceful shutdown and restart of the compute node. You need to always fetch a new basebackup and re-initialize.

hlinnaka avatar Oct 01 '22 08:10 hlinnaka

But I have expected that neon_local pg start main actually does it (restore compute node from backup), doesn't it? Please notice that i my test Neon is launched locally at one server. So disk space exhaustion affect all: compute node, pageserver and safekeepers.

I just want to repeat the sequence of actions:

  1. Data is inserted in database
  2. Disk space exhausted
  3. neon_local stop
  4. Free some space on the disk
  5. neon_local start
  6. neon_local pg start main

knizhnik avatar Oct 01 '22 11:10 knizhnik