zfsbackup-go icon indicating copy to clipboard operation
zfsbackup-go copied to clipboard

zfsbackup seems to hang on Solaris with B2

Open lhoward opened this issue 5 years ago • 13 comments

I'm running zfsbackup on Solaris 11. It seems to work fine for a small pool of a few MB, but appears to hang when I try on a real pool (still small, around 1GB).

I'm using the B2 backend. The file backend works fine.

I'm wondering if it may have something to do with the way the tasks are parallelized (although setting --maxFileBuffer=0 doesn't make any difference).

Here's where it hangs (pool names etc changed):

% zfsbackup send --maxFileBuffer=0 --encryptTo [email protected] --signFrom [email protected] --publicKeyRingPath /var/backups/.gnupg/pubring.gpg.asc --secretKeyRingPath /var/backups/.gnupg/secring.gpg.asc --logLevel debug --full tank/Foobar b2://example-com-tank-git
2019/01/23 10:48:20 Setting number of cores to: 2
2019/01/23 10:48:20 Loaded private key ring /var/backups/.gnupg/secring.gpg.asc
2019/01/23 10:48:20 Loaded public key ring /var/backups/.gnupg/pubring.gpg.asc
2019/01/23 10:48:20 Setting working directory to /var/backups/.zfsbackup
2019/01/23 10:48:20 PGP Debug Info:
...
2019/01/23 10:48:20 The number of parallel uploads (4) is greater than the number of active files allowed (0), this may result in an unachievable max parallel upload target.
2019/01/23 10:48:20 Getting ZFS Snapshots with command "zfs list -H -d 1 -t snapshot -r -o name,creation -S creation tank/Foobar"
2019/01/23 10:48:20 Found ZFS snapshot "tank/Foobar@20181010" from 2018-10-10 18:20:00 +0000 UTC
2019/01/23 10:48:20 Found ZFS snapshot "tank/Foobar@20151003" from 2015-10-03 02:09:00 +0000 UTC
2019/01/23 10:48:20 Utilizing smart option.
2019/01/23 10:48:20 Limiting the number of active files to 0
2019/01/23 10:48:20 Limiting the number of parallel uploads to 4
2019/01/23 10:48:20 Max Backoff Time will be 30m0s
2019/01/23 10:48:20 Max Upload Retry Time will be 12h0m0s
2019/01/23 10:48:20 Upload Chunk Size will be 10MiB
2019/01/23 10:48:20 Will be using encryption key for [email protected]
2019/01/23 10:48:20 Will be signed from [email protected]
2019/01/23 10:48:20 Getting ZFS Snapshots with command "zfs list -H -d 1 -t snapshot -r -o name,creation -S creation tank/Foobar"
2019/01/23 10:48:21 Found ZFS snapshot "tank/Foobar@20181010" from 2018-10-10 18:20:00 +0000 UTC
2019/01/23 10:48:21 Found ZFS snapshot "tank/Foobar@20151003" from 2015-10-03 02:09:00 +0000 UTC
2019/01/23 10:48:21 Initializing Backend b2://example-com-tank-git
2019/01/23 10:48:21 Starting zfs send command: zfs send tank/Foobar@20181010
2019/01/23 10:48:21 Will be using internal gzip compressor with compression level 6.
2019/01/23 10:48:21 Starting volume tank/Foobar|20181010.zstream.gz.pgp.vol1
2019/01/23 10:48:23 b2 backend: Processing volume tank/Foobar|20181010.zstream.gz.pgp.vol1

Here are a few pertinent stack traces from gdb:

(gdb) thr 4
[Switching to thread 4 (LWP    3        )]
#0  0xffff80ffbf59046a in _portfs () from /lib/64/libc.so.1
(gdb) bt
#0  0xffff80ffbf59046a in _portfs () from /lib/64/libc.so.1
#1  0xffff80ffbf5107a4 in port_getn () from /lib/64/libc.so.1
#2  0x000000000045855a in runtime.asmsysvicall6 () at /opt/go/src/runtime/sys_solaris_amd64.s:104
#3  0x00000000004561ff in runtime.asmcgocall () at /opt/go/src/runtime/asm_amd64.s:718
#4  0x0000000000000000 in ?? ()
(gdb) thr 14
[Switching to thread 14 (Thread 3 (LWP 3))]
#0  0xffff80ffbf59046a in _portfs () from /lib/64/libc.so.1
(gdb) bt
#0  0xffff80ffbf59046a in _portfs () from /lib/64/libc.so.1
#1  0xffff80ffbf5107a4 in port_getn () from /lib/64/libc.so.1
#2  0x000000000045855a in runtime.asmsysvicall6 () at /opt/go/src/runtime/sys_solaris_amd64.s:104
#3  0x00000000004561ff in runtime.asmcgocall () at /opt/go/src/runtime/asm_amd64.s:718
#4  0x0000000000000000 in ?? ()
(gdb) thr 1
[Switching to thread 1 (LWP    1        )]
#0  0xffff80ffbf587e97 in __lwp_park () from /lib/64/libc.so.1
(gdb) bt
#0  0xffff80ffbf587e97 in __lwp_park () from /lib/64/libc.so.1
#1  0xffff80ffbf57a6fe in sema_wait_impl () from /lib/64/libc.so.1
#2  0xffff80ffbf57a8b3 in sema_reltimedwait () from /lib/64/libc.so.1
#3  0xffff80ffbf56dda7 in sem_reltimedwait_np () from /lib/64/libc.so.1
#4  0x000000000045855a in runtime.asmsysvicall6 () at /opt/go/src/runtime/sys_solaris_amd64.s:104
#5  0x00000000004561b9 in runtime.asmcgocall () at /opt/go/src/runtime/asm_amd64.s:688
#6  0x000000c42003e000 in ?? ()
#7  0x00000000012d2600 in github.com/someone1/zfsbackup-go/cmd.jobInfo ()
#8  0xffff80ffbffff868 in ?? ()
#9  0x0000000000434432 in runtime.entersyscallblock_handoff () at /opt/go/src/runtime/proc.go:2920
#10 0x000000c42008a180 in ?? ()
#11 0x00000000004300b0 in ?? () at /opt/go/src/runtime/proc.go:1092
#12 0xffff80ff00000000 in ?? ()
#13 0x0000000000454827 in runtime.rt0_go () at /opt/go/src/runtime/asm_amd64.s:258
#14 0x000000000000000f in ?? ()
#15 0xffff80ffbffff8d8 in ?? ()
#16 0x000000000000000f in ?? ()
#17 0xffff80ffbffff8d8 in ?? ()
#18 0x0000000000000000 in ?? ()
(gdb) 

lhoward avatar Jan 23 '19 00:01 lhoward

(As to could it be that I wasn't sufficiently patient – I did leave it to run overnight, so presumably this is enough to compress/encrypt 1GB even on a lowly Celeron.)

Edit: this is irrelevant, files backend was quick.

lhoward avatar Jan 23 '19 00:01 lhoward

setting B2_LOG_LEVEL=10, I can see an API call for b2_list_file_versions for the first chunk (vol1) (with the expected empty response on a new bucket), but nothing after that.

Compared to a successful upload (with a smaller pool), the difference is that b2_get_upload_url is never called (nor b2_upload_file).

lhoward avatar Jan 23 '19 07:01 lhoward

Thanks for reporting this - I assume you compiled your own version from tip? Can you try compiling from the last release version and let me know if the problem persists? I'm curious as to whether the issue is with the latest change I made to the b2 backend, the b2 backend library code I'm using itself, or Go on Solaris.

Can you also provide more details on your system (version, architecture, etc)?

someone1 avatar Jan 24 '19 14:01 someone1

I can try although the only material changes since v0.3-rc1 seem related to the Azure backend (which I'm not using, but I did need to make some changes to the SDK to get it to compile on Solaris, hence reluctant to rebase). (And, to support SHA1.)

I'm just running a vanilla Oracle Solaris 11.3 system. Small pools backup and restore to B2 successfully, and large pools backup OK with the file backend.

lhoward avatar Jan 24 '19 22:01 lhoward

go version go1.10.1 solaris/amd64

lhoward avatar Jan 25 '19 05:01 lhoward

Just recompiled with go1.11.5, no change.

lhoward avatar Jan 25 '19 07:01 lhoward

I added some debugging log statements to github.com/kurin/blazer/b2/writer.go, it looks like ListObjects() in w.getLargeFile() is returning EOF, which bubbles up to Writer.sendChunk().

If I change getLargeFile() so that it sets w.Resume = false if ListObjects() returns EOF, and calls back into getLargeFile() (i.e. the same logic as if no objects were returned), then it looks more promising (am testing now).

lhoward avatar Jan 26 '19 00:01 lhoward

Perhaps I need to refile this against blazer, unless it's a usage issue.

lhoward avatar Jan 26 '19 00:01 lhoward

PS. zfsbackup prints the command line usage when there is an error, which sometimes obscures the error message. It would be good to fix this.

lhoward avatar Jan 26 '19 00:01 lhoward

Here's the patch I'm using which I think works but I don't know enough about B2 nor Go to say confidently (and obviously, for something as important as backups I need to do some verification).

diff --git a/b2/writer.go b/b2/writer.go
index 209cfce..bedf626 100644
--- a/b2/writer.go
+++ b/b2/writer.go
@@ -311,7 +311,12 @@ func (w *Writer) getLargeFile() (beLargeFileInterface, error) {
                cur := &Cursor{name: w.name}
                objs, _, err := w.o.b.ListObjects(w.ctx, 1, cur)
                if err != nil {
-                       return nil, err
+                       if err == io.EOF {
+                               w.Resume = false
+                               return w.getLargeFile()
+                       } else {
+                               return nil, err
+                       }
                }
                if len(objs) < 1 || objs[0].name != w.name {
                        w.Resume = false

lhoward avatar Jan 26 '19 00:01 lhoward

Thanks for all the debugging! Looks like a fix may have been made to blazer.

Can you please provide details as to what SHA1 change you mentioned you made?

someone1 avatar Jan 26 '19 04:01 someone1

I didn't make a change to SHA1, I was just listing the things that appeared to have changed in master since v0.3-rc1 (sorry, I worded it very confusingly).

Anyway, with the upgraded blazer, zfsbackup appears to work as long as I set w.Resume = false in the B2 backend. (I can confirm successfully restoring from a backup.)

lhoward avatar Jan 28 '19 01:01 lhoward

The change to set Resume=true was to try and stop duplicate files from appearing on B2 as reported in #7.

The intended goal was:

  • If the file already exists, try to resume it
  • Otherwise, start a new file

I will admit I did not release a new version after this change since I hadn't gotten around to testing the change just yet (funny how "I'll get to this next week" turns into 6+ months...). From the code, it reads as if we didn't list any files, then it would start a normal upload (e.g. set Resume=false internally and retry the same call). Obviously this wasn't happening until the most recent fix was made since you stumbled across this bug (thank you!) Still unclear why you need to set Resume=false to get it working though (I'll do some testing!)

I think the path forward here will be to make the following adjustments:

  • Never set Resume=true so this operates like the other backends (every call to Upload starts a fresh object to upload to)
  • Try to cancel the large file upload if we failed (I think I'll need to open a feature request with blazer for this)

The other backends work differently in ways:

  • S3: you can set a lifecycle policy in order to auto-clear multipart uploads that never completed - but the Go SDK provided by AWS should clean this up for you.
  • GCS: Resumable uploads will expire after 1-week automatically
  • B2: No Lifecycle rules, versioning enabled by default on buckets may be confusing to users, and left to the user to cleanup after a failed upload
  • Azure: Same as B2 (minus the versioning)

I'll hopefully have some fixes in for you to try soon!

someone1 avatar Jan 28 '19 22:01 someone1