stolon icon indicating copy to clipboard operation
stolon copied to clipboard

Failed to restore 700gb+ db with wal-e

Open t0k4rt opened this issue 5 years ago • 6 comments

Environment

Linux

Stolon version 0.13

Expected behaviour you didn't see

Cluster is inistialized with init mode PITR and wal-e credentials in order to fetch remote partitions and wal files Cluster starts as expected

Unexpected behaviour you saw

Postgresql is still importing wal files but the keeper throws an error and the restoration fails

postgres + wal-e logs:

wal_e.operator.backup INFO     MSG: promoted prefetched wal segment
        STRUCTURED: time=2019-07-24T15:35:36.981494-00 pid=31606 action=wal-fetch key=swift://wale-backup/wal_005/000000030000037A000000A3.lzo prefix= seg=000000030000037A000000A3
2019-07-24 15:35:37 UTC LOG:  restored log file "000000030000037A000000A3" from archive

Keeper error:

ERROR	cmd/keeper.go:1163	recovery not finished	{"error": "timeout waiting for db recovery"}
ERROR	cmd/keeper.go:1006	db failed to initialize or resync
ERROR	cmd/keeper.go:641	cannot get configured pg parameters	{"error": "pq: the database system is shutting down"}

I tried to start a postgresql cluster that is not managed by stolon, it worked as expected, the cluster started.

Steps to reproduce the problem

the database make more than 5h to restore, and takes more than 700gb on disk. I think it's the main issue

t0k4rt avatar Jul 24 '19 16:07 t0k4rt

I saw the new option "DBWaitReadyTimeout" in the laster documentation, do you know when will it be available ?

t0k4rt avatar Jul 25 '19 09:07 t0k4rt

@t0k4rt you should try to increase the cluster spec parameter called syncTimeout (defaults to 30 minutes). Looks likes it's not documented (and it'll probably require a better name).

sgotti avatar Jul 25 '19 12:07 sgotti

Thanks a lot ! I'll try that !

t0k4rt avatar Jul 27 '19 19:07 t0k4rt

Hi @sgotti !

@t0k4rt you should try to increase the cluster spec parameter called syncTimeout (defaults to 30 minutes). Looks likes it's not documented (and it'll probably require a better name).

In the long term it makes sense to implement progress monitoring of such background jobs (using external tools such as lsof for recovery worker) and apply timeouts only for stuck tasks.

maksm90 avatar Jul 30 '19 14:07 maksm90

Would someone please enhance the documention of Stolon to include the purpose en default value (30 minutes) of syncTimout in relation to pitr?

Maybe here:

https://github.com/sorintlab/stolon/blob/master/doc/cluster_spec.md https://github.com/sorintlab/stolon/blob/master/doc/pitr.md https://github.com/sorintlab/stolon/blob/master/doc/pitr_wal-e.md https://github.com/sorintlab/stolon/blob/master/doc/pitr_wal-g.md

As we also had the unpleasant experience of discovering this timeout during the point in time recovery of full backup + recovery of around 24 hours worth of WAL took longer than half an hour.

We used a 24 hour timeout like this:

stolonctl --cluster-name primary-postgres-cluster --store-endpoints ... --log-level info --store-backend etcdv3 init '{
     "syncTimeout": "24h",
     "initMode": "pitr",
     "failInterval": "2m0s",
     "synchronousReplication": true,
     "usePgrewind": true,
     "pitrConfig": {
         "dataRestoreCommand": "envdir /etc/wal-e.d/env wal-e backup-fetch %d LATEST",
         "archiveRecoverySettings": {
             "restoreCommand": "envdir /etc/wal-e.d/env wal-e wal-fetch \"%f\" \"%p\"",
             "recoveryTargetSettings": { "recoveryTargetTime": "2019-12-31 01:02:03" }
         }
     },
     "pgParameters": {
         "max_connections": "1000",
         "shared_buffers": "512MB",
         "local_preload_libraries": "...",
         "extwlist.extensions": "..."
     }
}'

johannesboon avatar Dec 04 '19 12:12 johannesboon

@johannesboon Feel free to open an RFE issue to request this to be documented and also a PR to add this to the doc.

sgotti avatar Dec 09 '19 11:12 sgotti