wal-g icon indicating copy to clipboard operation
wal-g copied to clipboard

wal-g backup-push to AWS S3 remains hanging `Finished writing part 1.`

Open Krieke opened this issue 4 years ago • 2 comments

wal-g version: 0.2.16 Postgres version 11

Docker Image based on: alpine version 3.11 with glibc-2.31-r0.apk

Since a few days none of the wal-g backups succeed. For wal-g backu-push the process just remains hanging endlessly. See Logs

The backup of the wal files have been disabled in this setup since the postgres process would not come up in a healthy state due to wal-g hanging process.

Stackdump:

bash-5.0$ SIGQUIT: quit PC=0x9ebf51 m=0 sigcode=0 goroutine 0 [idle]: runtime.futex(0x26fc3c8, 0x80, 0x0, 0x0, 0x0, 0x7fcc00000000, 0xffffffff00000001, 0x100c4ff59, 0x7fffd0053918, 0x99955f, ...) /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/sys_linux_amd64.s:535 +0x21 runtime.futexsleep(0x26fc3c8, 0x7fcc00000000, 0xffffffffffffffff) /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/os_linux.go:44 +0x46 runtime.notesleep(0x26fc3c8) /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/lock_futex.go:151 +0x9f runtime.stopm() /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/proc.go:1928 +0xc0 runtime.findrunnable(0xc000044500, 0x0) /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/proc.go:2391 +0x53f runtime.schedule() /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/proc.go:2524 +0x2be runtime.park_m(0xc000160d80) /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/proc.go:2610 +0x9d runtime.mcall(0x16260f0) /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/asm_amd64.s:318 +0x5b goroutine 1 [semacquire, 3 minutes]: sync.runtime_Semacquire(0xc000157298) /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/sema.go:56 +0x42 sync.(*WaitGroup).Wait(0xc000157290) /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/sync/waitgroup.go:130 +0x64 github.com/wal-g/wal-g/internal.(*StorageTarBall).AwaitUploads(0xc000288100) /home/travis/gopath/src/github.com/wal-g/wal-g/internal/storage_tar_ball.go:61 +0x37 github.com/wal-g/wal-g/internal.(*Bundle).FinishQueue(0xc000576000, 0x21, 0xc00054d928) /home/travis/gopath/src/github.com/wal-g/wal-g/internal/bundle.go:163 +0xdd github.com/wal-g/wal-g/internal.createAndPushBackup(0xc000554090, 0xc000560000, 0x21, 0x1934556, 0x10, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /home/travis/gopath/src/github.com/wal-g/wal-g/internal/backup_push_handler.go:129 +0x6e5 github.com/wal-g/wal-g/internal.HandleBackupPush(0xc000554090, 0x7fffd0055a12, 0x21, 0x100) /home/travis/gopath/src/github.com/wal-g/wal-g/internal/backup_push_handler.go:246 +0x256 github.com/wal-g/wal-g/cmd/pg.glob..func4(0x267e6c0, 0xc0003616a0, 0x1, 0x2) /home/travis/gopath/src/github.com/wal-g/wal-g/cmd/pg/backup_push.go:27 +0x86 github.com/spf13/cobra.(*Command).execute(0x267e6c0, 0xc000361660, 0x2, 0x2, 0x267e6c0, 0xc000361660) /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/cobra/command.go:830 +0x2aa github.com/spf13/cobra.(*Command).ExecuteC(0x267d7c0, 0x9ca6fa, 0x2691b40, 0xc000000180) /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/cobra/command.go:914 +0x2fb github.com/spf13/cobra.(*Command).Execute(...) /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/cobra/command.go:864 github.com/wal-g/wal-g/cmd/pg.Execute() /home/travis/gopath/src/github.com/wal-g/wal-g/cmd/pg/pg.go:29 +0x31 main.main() /home/travis/gopath/src/github.com/wal-g/wal-g/main/pg/main.go:8 +0x20 goroutine 6 [syscall, 3 minutes]: os/signal.signal_recv(0x0) /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/sigqueue.go:147 +0x9c os/signal.loop() /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/os/signal/signal_unix.go:23 +0x22 created by os/signal.init.0 /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/os/signal/signal_unix.go:29 +0x41 goroutine 8 [select]: go.opencensus.io/stats/view.(*worker).start(0xc00037c820) /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/go.opencensus.io/stats/view/worker.go:154 +0x100 created by go.opencensus.io/stats/view.init.0 /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/go.opencensus.io/stats/view/worker.go:32 +0x57 goroutine 41 [select]: github.com/aws/aws-sdk-go/aws.SleepWithContext(0x1bff6a0, 0xc00003c078, 0x1eaaf1e000, 0x0, 0x0) /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/aws/aws-sdk-go/aws/context_sleep.go:16 +0x128 github.com/aws/aws-sdk-go/aws/corehandlers.glob..func5(0xc0000cc900) /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/aws/aws-sdk-go/aws/corehandlers/handlers.go:202 +0x142 github.com/aws/aws-sdk-go/aws/request.(*HandlerList).Run(0xc0000ccbc0, 0xc0000cc900) /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/aws/aws-sdk-go/aws/request/handlers.go:261 +0x98 github.com/aws/aws-sdk-go/aws/request.(*Request).Send(0xc0000cc900, 0x0, 0x0) /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/aws/aws-sdk-go/aws/request/request.go:519 +0x184 github.com/aws/aws-sdk-go/service/s3/s3manager.(*uploader).singlePart(0xc000122000, 0x1be35e0, 0xc002f644b0, 0xc002f64480, 0x0, 0x0, 0x0) /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/aws/aws-sdk-go/service/s3/s3manager/upload.go:512 +0x182 github.com/aws/aws-sdk-go/service/s3/s3manager.(*uploader).upload(0xc000122000, 0x0, 0x0, 0x0) /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/aws/aws-sdk-go/service/s3/s3manager/upload.go:378 +0x55f github.com/aws/aws-sdk-go/service/s3/s3manager.Uploader.UploadWithContext(0x1400000, 0x10, 0x0, 0x2710, 0x1c20e80, 0xc00050c000, 0x0, 0x0, 0x0, 0x0, ...) /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/aws/aws-sdk-go/service/s3/s3manager/upload.go:294 +0x17f github.com/aws/aws-sdk-go/service/s3/s3manager.Uploader.Upload(...) /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/aws/aws-sdk-go/service/s3/s3manager/upload.go:264 github.com/wal-g/storages/s3.(*Uploader).upload(0xc0005580e0, 0xc000036074, 0x14, 0xc00003e0f0, 0x4d, 0x1bd7320, 0xc000216140, 0x20, 0x17f8120) /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/wal-g/storages/s3/uploader.go:66 +0x188 github.com/wal-g/storages/s3.(*Folder).PutObject(0xc0005580e0, 0xc000040100, 0x3d, 0x1bd7320, 0xc000216140, 0x0, 0x0) /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/wal-g/storages/s3/folder.go:131 +0xbe github.com/wal-g/wal-g/internal.(*Uploader).Upload(0xc00037c1e0, 0xc000040100, 0x3d, 0x1bd8180, 0xc000010070, 0x0, 0x0) /home/travis/gopath/src/github.com/wal-g/wal-g/internal/uploader.go:81 +0xc1 github.com/wal-g/wal-g/internal.(*StorageTarBall).startUpload.func1(0xc00037c1e0, 0xc000040100, 0x3d, 0xc000010070) /home/travis/gopath/src/github.com/wal-g/wal-g/internal/storage_tar_ball.go:82 +0xbe created by github.com/wal-g/wal-g/internal.(*StorageTarBall).startUpload /home/travis/gopath/src/github.com/wal-g/wal-g/internal/storage_tar_ball.go:79 +0x2a9 rax 0xca rbx 0x26fc280 rcx 0x9ebf53 rdx 0x0 rdi 0x26fc3c8 rsi 0x80 rbp 0x7fffd00538e0 rsp 0x7fffd0053898 r8 0x0 r9 0x0 r10 0x0 r11 0x286 r12 0x0 r13 0x1 r14 0x2 r15 0xc00003aba0 rip 0x9ebf51 rflags 0x286 cs 0x33 fs 0x0 gs 0x0

Logs:

DEBUG: 2020/06/17 11:20:51.903671 --- COMPILED ENVIRONMENT VARS --- DEBUG: 2020/06/17 11:20:51.903685 HOSTNAME=whatever DEBUG: 2020/06/17 11:20:51.903697 PGPORT=5432 DEBUG: 2020/06/17 11:20:51.903701 PGPASSWORD=whatever DEBUG: 2020/06/17 11:20:51.903704 AWS_REGION=a-region DEBUG: 2020/06/17 11:20:51.903716 WALG_S3_PREFIX=s3://a-backup-ref DEBUG: 2020/06/17 11:20:51.903719 HOME=/var/lib/postgresql DEBUG: 2020/06/17 11:20:51.903723 LANG=en_US.utf8 DEBUG: 2020/06/17 11:20:51.903733 PGUSER=a-user DEBUG: 2020/06/17 11:20:51.903736 AWS_SECRET_ACCESS_KEY=a-secret DEBUG: 2020/06/17 11:20:51.903752 PG_MAJOR=11 DEBUG: 2020/06/17 11:20:51.903765 PG_VERSION=11.8 DEBUG: 2020/06/17 11:20:51.903779 WALG_LOG_LEVEL=DEVEL DEBUG: 2020/06/17 11:20:51.903783 SHLVL=2 DEBUG: 2020/06/17 11:20:51.903786 AWS_ACCESS_KEY_ID=a-key DEBUG: 2020/06/17 11:20:51.903789 PGDATABASE=a-db DEBUG: 2020/06/17 11:20:51.903796 PGHOST=whatever DEBUG: 2020/06/17 11:20:51.903803 PGDATA=/var/lib/postgresql/data DEBUG: 2020/06/17 11:20:51.903814 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin DEBUG: 2020/06/17 11:20:51.903838 _=/usr/local/bin/wal-g DEBUG: 2020/06/17 11:20:51.903840 WALG_TAR_SIZE_THRESHOLD=1073741823 DEBUG: 2020/06/17 11:20:51.903842 WALG_UPLOAD_DISK_CONCURRENCY=1 DEBUG: 2020/06/17 11:20:51.903844 WALG_UPLOAD_QUEUE=2 DEBUG: 2020/06/17 11:20:51.903846 WALG_USE_WAL_DELTA=false DEBUG: 2020/06/17 11:20:51.903852 WALG_DOWNLOAD_CONCURRENCY=10 DEBUG: 2020/06/17 11:20:51.903857 WALG_UPLOAD_CONCURRENCY=16 DEBUG: 2020/06/17 11:20:51.903864 TOTAL_BG_UPLOADED_LIMIT=32 DEBUG: 2020/06/17 11:20:51.903869 WALG_PREVENT_WAL_OVERWRITE=false DEBUG: 2020/06/17 11:20:51.903872 WALG_COMPRESSION_METHOD=lz4 DEBUG: 2020/06/17 11:20:51.903875 WALG_DELTA_MAX_STEPS=0 DEBUG: 2020/06/17 11:20:51.903880 OPLOG_ARCHIVE_TIMEOUT=60 DEBUG: 2020/06/17 11:20:51.903883 OPLOG_ARCHIVE_AFTER_SIZE=33554432 DEBUG: 2020/06/17 11:20:51.903887 MONGODB_LAST_WRITE_UPDATE_SECONDS=3 INFO: 2020/06/17 11:20:51.904114 Doing full backup. WARNING: 2020/06/17 11:20:51.910103 It seems your archive_mode is not enabled. This will cause inconsistent backup. Please consider configuring WAL archiving. INFO: 2020/06/17 11:20:51.910696 Calling pg_start_backup() INFO: 2020/06/17 11:20:52.043468 Walking ... DEBUG: 2020/06/17 11:20:52.043669 / INFO: 2020/06/17 11:20:52.043684 Starting part 1 ... DEBUG: 2020/06/17 11:20:52.045345 /PG_VERSION DEBUG: 2020/06/17 11:20:52.045405 /base DEBUG: 2020/06/17 11:20:52.046166 /base/1 DEBUG: 2020/06/17 11:20:52.046228 /base/1/112

continue with listing base files... not added since the ticket will be to long

DEBUG: 2020/06/17 11:20:52.406093 /base/16388/PG_VERSION DEBUG: 2020/06/17 11:20:52.406215 /base/16388/pg_filenode.map DEBUG: 2020/06/17 11:20:52.406377 /global

continue with listing global files... not added since the ticket will be to long

DEBUG: 2020/06/17 11:20:52.414776 /global/pg_filenode.map DEBUG: 2020/06/17 11:20:52.414917 /global/pg_internal.init DEBUG: 2020/06/17 11:20:52.415084 /pg_commit_ts DEBUG: 2020/06/17 11:20:52.415367 /pg_dynshmem DEBUG: 2020/06/17 11:20:52.415416 /pg_hba.conf DEBUG: 2020/06/17 11:20:52.415434 /pg_ident.conf DEBUG: 2020/06/17 11:20:52.415588 /pg_logical DEBUG: 2020/06/17 11:20:52.415833 /pg_logical/mappings DEBUG: 2020/06/17 11:20:52.415889 /pg_logical/replorigin_checkpoint DEBUG: 2020/06/17 11:20:52.415937 /pg_logical/snapshots DEBUG: 2020/06/17 11:20:52.416237 /pg_multixact DEBUG: 2020/06/17 11:20:52.416310 /pg_multixact/members DEBUG: 2020/06/17 11:20:52.416360 /pg_multixact/members/0000 DEBUG: 2020/06/17 11:20:52.416404 /pg_multixact/offsets DEBUG: 2020/06/17 11:20:52.416637 /pg_multixact/offsets/0000 DEBUG: 2020/06/17 11:20:52.416667 /pg_notify DEBUG: 2020/06/17 11:20:52.416869 /pg_replslot DEBUG: 2020/06/17 11:20:52.416943 /pg_serial DEBUG: 2020/06/17 11:20:52.417006 /pg_snapshots DEBUG: 2020/06/17 11:20:52.417071 /pg_stat DEBUG: 2020/06/17 11:20:52.417145 /pg_stat_tmp DEBUG: 2020/06/17 11:20:52.417213 /pg_subtrans DEBUG: 2020/06/17 11:20:52.417304 /pg_tblspc DEBUG: 2020/06/17 11:20:52.417365 /pg_twophase DEBUG: 2020/06/17 11:20:52.417432 /pg_wal DEBUG: 2020/06/17 11:20:52.417488 /pg_xact DEBUG: 2020/06/17 11:20:52.417537 /pg_xact/0000 DEBUG: 2020/06/17 11:20:52.417562 /postgresql.auto.conf DEBUG: 2020/06/17 11:20:52.418118 /postgresql.conf DEBUG: 2020/06/17 11:20:52.418140 /stolon-temp-postgresql.conf INFO: 2020/06/17 11:20:52.418619 Finished writing part 1.

Krieke avatar Jun 17 '20 11:06 Krieke

This issue can be resolved... the reason why it suddenly appeared was that postgres has modified there Dockerfile for alpine to depende on alpine 3.12 iso alpine 3.11

Would have been nice to have alpine supported binaries, but I will modify my dependency environment so I will never run against this issue again.

Krieke avatar Jun 17 '20 15:06 Krieke

This issue is still happening for a lot of different reasons. For example, if you are trying to build a docker image based on official postgres image, and add wal-g into it, if ca-certificates is not installed (which, by default, is not installed), then connecting to S3 provider (either AWS S3, B2, etc.) that requiring HTTPS connection is not possible. wal-g should fail instantly and print error message in this case, but it just hangs indefinitely, which is very confusing.

Also as a user, there is no way to instruct wal-g to produce verbose logging via commandline options. There is a WALG_LOG_LEVEL environment, but no documentation mentions it, and I would never know it if I haven't spent one full hour reading all the issues...

So in conclusion:

  • wal-g can hang for a bunch of very different reasons
  • currently it does not handle SSL errors, which may lead to this issue
  • there should be some documentation about enabling verbose logging
  • there should be some documentation about how to debug this program

JokerQyou avatar Jun 26 '21 17:06 JokerQyou