blazer icon indicating copy to clipboard operation
blazer copied to clipboard

EOF issue with zfsbackup-go

Open lhoward opened this issue 5 years ago • 24 comments

Let me preface this by saying that I am not a Go programmer, nor an expert on B2; just a humble punter trying to get zfsbackup-go working on my Solaris machine. I don't know enough to know whether this issue is in zfsbackup-go's usage of Blazer, or in Blazer itself.

cf https://github.com/someone1/zfsbackup-go/issues/17

ListObjects() in Writer.getLargeFile() could return EOF, which was causing zfsbackup-go to hang starting a new upload. Treating EOF identically to len(objs) < 1 (i.e. restarting with w.Resume = false) seemed to fix it, but I'm not sure if this is the correct fix.

diff --git a/b2/writer.go b/b2/writer.go
index 209cfce..bedf626 100644
--- a/b2/writer.go
+++ b/b2/writer.go
@@ -311,7 +311,12 @@ func (w *Writer) getLargeFile() (beLargeFileInterface, error) {
                cur := &Cursor{name: w.name}
                objs, _, err := w.o.b.ListObjects(w.ctx, 1, cur)
                if err != nil {
-                       return nil, err
+                       if err == io.EOF {
+                               w.Resume = false
+                               return w.getLargeFile()
+                       } else {
+                               return nil, err
+                       }
                }
                if len(objs) < 1 || objs[0].name != w.name {
                        w.Resume = false

lhoward avatar Jan 26 '19 01:01 lhoward

This feels like an issue in Blazer but let me see if I can get a test case failing. This code is possibly more complicated than it ought to be 🙄

kurin avatar Jan 26 '19 02:01 kurin

I suspect you've surfaced a bug where if Resume is true but there's nothing to resume it just blows up.

kurin avatar Jan 26 '19 03:01 kurin

Thanks for catching this! Try with the latest commit, you should be good.

kurin avatar Jan 26 '19 04:01 kurin

Apologies for the tardy chiming in - I added that bit since retries after a network failure would result in incomplete open large file uploads. When I made the change I thought blazer would know to start a new file if there was nothing to resume (may have just read the code wrong at the time!)

Anyway, thanks for all the debugging @lhoward and for the quick fix @kurin !

someone1 avatar Jan 26 '19 04:01 someone1

This fix unfortunately gets it stuck in an endless loop with

2019/01/26 15:48:19 b2: Opening volume tank/Git|20190126.zstream.gz.pgp.vol1
2019/01/26 15:48:19 b2 backend: Entering tank/Git|20190126.zstream.gz.pgp.vol1
2019/01/26 15:48:19 b2 backend: Uploading tank/Git|20190126.zstream.gz.pgp.vol1
2019/01/26 15:48:19 b2 backend: Error while uploading volume tank/Git|20190126.zstream.gz.pgp.vol1 - context canceled
2019/01/26 15:48:19 b2: Error while uploading volume tank/Git|20190126.zstream.gz.pgp.vol1 - context canceled

lhoward avatar Jan 26 '19 04:01 lhoward

Does this happen with any invocation? I have a zfs box around here somewhere I can try to test this on, but if you can reproduce it with the environment variable B2_LOG_LEVEL=2 you'll get a lot of debug lines that you could add to the bug (be sure to purge of secret data).

kurin avatar Jan 26 '19 05:01 kurin

I wasn't exactly sure what data to consider secret, so for now I'll just paste in a summary – happy to send you the complete details privately if you mail me (lukeh at padl.com).

  • b2_list_unfinished_large_files (returns a fileId, nextFileId == null)
  • b2_list_parts (returns a bunch)
  • b2_get_upload_part_url (returns upload URLs for a bunch of fileIDs)

The key appears to be this ("chunks don't match"):

2019/01/26 18:21:44 error writing tank/Git|20190126.zstream.gz.pgp.vol1: resumable upload was requested, but chunks don't match
2019/01/26 18:21:44 << b2_get_upload_part_url (14) 200  {Cache-Control: max-age=0, no-cache, no-store; Content-Type: application/json;charset=utf-8; Content-Length: 376; Date: Sat, 26 Jan 2019 07:31:33 GMT} ({
  "authorizationToken": "[redacted]",
  "fileId": "some-fileid-redacted",
  "uploadUrl": "https://pod-000-1118-16.backblaze.com/b2api/v1/b2_upload_part/some-fileid-redacted/0055"
}
)
2019/01/26 18:21:44 >> b2_finish_large_file uri: https://api002.backblazeb2.com/b2api/v1/b2_finish_large_file headers: {X-Blazer-Request-Id: 17;User-Agent: blazer/0.5.1} args: ({"fileId":"some-fileid-redacted","partSha1Array":["67a935b099a061116bed27ba1ea8403fc3fe50e4","71dc3437d1dfa17cf36e1da1d9373fc875bd37ce","b5c1c32096584dce8837ee26c0d371d6a69b6bd5","826363c88715be8e4aa2190c91e0c0bfed4b7941","a36bdad6cc27bfa4e26a69a3501ad409a3e999f2","780c49397e0f2d64d7012eeeffe293308a2376c1","a6a79531a551b1ca2f0e9c0287da78d2fd456ed6","cb8ae6c70cd4f7b7e7593aa6c011c966a6224c69","fb3ddd9a9e2d32bb8c676025f43e3c8133a9fc5e","53ce7c6a8f6da7a9d6fdefdac4e95066b38a8cea"]})
2019/01/26 18:21:44 b2 backend: Error while uploading volume tank/Git|20190126.zstream.gz.pgp.vol1 - context canceled
2019/01/26 18:21:44 b2: Error while uploading volume tank/Git|20190126.zstream.gz.pgp.vol1 - context canceled

lhoward avatar Jan 26 '19 07:01 lhoward

Oooh okay so what's happening is that to resume the large file, the checksums of the already-uploaded chunks have to match the checksums of the corresponding chunks of the current file; if not you're really uploading a different file.

The reason your checksums don't match, I'm pretty sure, is because the snapshot you're uploading is encrypted, and the encryption is of course resulting in a different data stream each time.

This library is designed so that clients who receive errors will by default simply retry forever; the only permanent failure should be if the process itself dies. @someone1 do you have examples of the kinds of failures that led you to set w.Resume = true?

I need to think about whether there's a better way to handle non-matching resumes. I think it might be a good idea to add an option, when there's a conflict, to remove the partially uploaded file and start from scratch.

kurin avatar Jan 26 '19 20:01 kurin

So setting resume to false should fix it? There is a resume command line option to zfsbackup BTW but I’m not setting that. And yes I am encrypting.

lhoward avatar Jan 26 '19 22:01 lhoward

Is this resuming within or across a zfsbackup session, I wonder? In the latter case it seems like it should be disabled if encryption is on (even if it’s off, unless the volumes are cached locally possibly an unencrypted backup may differ?) just speculating.

lhoward avatar Jan 26 '19 22:01 lhoward

I'm not familiar with zfsbackup-go but it sounds to me like it's starting the upload from the beginning with w.Resume = true, which works when the stream is the same, but fails when the stream is encrypted, because the same plaintext becomes different cyphertext.

I'm wondering what's happening to cause zfsbackup-go to restart from the beginning, however, because that shouldn't be necessary. If the backup is interrupted (e.g. if the network drops), the client should just hang until service is restored.

kurin avatar Jan 26 '19 23:01 kurin

If in zfsbackup-go's B2Backend.Upload(), I set w.Resume to false, the backup completes. @someone1 are there any risks in making this change?

lhoward avatar Jan 27 '19 23:01 lhoward

@kurin - I think it was a mistake to add Resume=true as the default option in zfsbackup-go precisely for the reason you described. I had a user complain that they saw multiple versions of files in their B2 buckets where some were not accessible and assumed this was due to incomplete large file uploads. I don't actually see a way in blazer to cancel a large file upload and my goal was to just resume or start a new upload.

I think it would make sense to provide the following in the blazer:

  • If a large file upload fails for whatever reason and an option such as AbortOnFailure=true, then blazer should automatically cancel the large file upload.
  • Add an option as you described to clear out any partially uploaded files when starting an upload (this should obviously be mutually exclusive to the Resume option).

I'd love to have my code better tested against B2 - the bonfire package you have seems interesting in this regard - can it be used to setup a local B2 emulator to write tests against? If so, mind sharing a quick example how?

someone1 avatar Jan 28 '19 22:01 someone1

I don't believe B2 has a way to automatically cancel large files.

The only permanent failure for the writer should be a cancelled context. Any other failure ought to be recoverable, and so wouldn't benefit from an abort-on-failure option, and when the context is cancelled we can't use it to clean up after ourselves.

It's possible to list unfinished files (with b2.ListUnfinished() option) and then delete them. Thinking about it, I don't think I want to add an option that lets callers set Resume but also removes the existing large file on chunk mismatch. When a caller requests Resume, I think it's better for it to fail permanently than to silently remove data.

Bonfire is an experimental reimplementation of the B2 server API that I haven't had time to work on in a while.

kurin avatar Jan 29 '19 20:01 kurin

B2 does not automatically cancel large files - but it does have a b2_cancel_large_file call which is not exposed via blazer. The Writer interface already handles chunking a file, initiating a large file upload, and completing it for you, can it be updated to also cancel the upload if any failures occur (or expose a call that does)? In addition to a canceled context, wouldn't a permanent network failure or internal server errors from backblaze eventually fail any sort of retry/backoff attempts blazer makes or does it operate in such a way that it will NEVER return an error? I know GCS/S3 Go SDKs will eventually fail outside of a canceled context. The automatic cleanup would be great for situations where the power is pulled and large files are left but that's a backblaze feature, not blazer (S3 uses lifecycle rules for this).

The AWS SDK for Go's s3manager will start a multipart upload and cancel it should it exhaust retries (network failures, cancelled context, etc.) This is helpful as a high-level API where the details surrounding running and managing a multipart upload are hidden from the developer. I thought it'd make sense as an option to the blazer Writer.

The options I was suggesting are (naming aside):

  • CancelOnError - this would simply call b2_cancel_large_file if the Writer errors out
  • ClearUnfinished - this would check if an unfinished upload exists for this object before uploading (similar to Resume), and if so, cancel/delete it before uploading. This option is mutually exclusive to Resume (you can't have both)

If you think this is better left outside the Writer interface, I understand - I like the feature of CancelOnError like the S3 SDK provides but can work around it. The second option was suggested since backblaze doesn't have a way to automatically clear unfinished uploads after a time period. Is the proper way to cancel an unfinished large upload to b2 without knowledge of the large file session id to list the unfinished object(s) and call Delete on it?

someone1 avatar Jan 29 '19 22:01 someone1

The B2 integration checklist specifies that some errors (including network errors) should be handled with exponentially backed-off retries. So in that sense blazer will retry forever even given network failures until a context is cancelled or the process exits.

It turns out some blazer clients would actually prefer just to receive and handle errors themselves, and I've kicked around some ideas for how to implement that (#55) but nothing's been done. It probably makes sense to add the ability to say "if NO progress has been made in , then kill the upload".

At the very least I should expose the b2_cancel_large_file request. I can't think of a way to gracefully handle the dead-context situation; it might be best to make an unexported cancel that doesn't rely on a context at all.

kurin avatar Jan 30 '19 00:01 kurin

In the interim, backups with w.Resume = false are working for me. Thank you for the detail analysis above, I'm certainly learning something!

lhoward avatar Jan 30 '19 00:01 lhoward

I agree there should be some way to call b2_cancel_large_file - looking forward to an implementation!

As for error handling/retires - given the b2 integration checklist, I think the current behavior is to spec and therefore fine (assuming it errors out for any non-408 400 error codes, seems like everything else should be retried). I can cancel requests using the Context I pass in.

One thing might be worth adjusting though: Instead of using time.Sleep here, to be more context sensitive it should probably be:

select {
case <-w.ctx.Done():
  err = w.ctx.Err()
case <-time.After(sleep):
   // Remainder of code  from 169-182
}

someone1 avatar Jan 31 '19 05:01 someone1

Yeah that's probably a good idea.

kurin avatar Feb 01 '19 06:02 kurin

So I've added a thing for this:

https://godoc.org/github.com/kurin/blazer/b2#WithCancelOnError

I'm not super happy about the interface though. The idea with ctxf is that you probably want a new context if the old one is cancelled, and probably with a short deadline on it so that your writer doesn't hang forever even after you cancel the writer's context. Play around with this and see how it fits your needs.

kurin avatar Feb 03 '19 20:02 kurin

Sorry to come back to this issue so late - finally updating zfsbackup-go and while updating my backend code all around I noticed how the google cloud storage handled all sort of situations using a cancelled context and wonder if it would be a good fit here: https://pkg.go.dev/cloud.google.com/go/storage?tab=doc#Writer.CloseWithError (read the deprecation notice)

Basically, instead of using an option or anything of the sort, if the context.Context passed to NewWriter is cancelled - then the upload would cancel. In the case of b2 it would call b2_cancel_large_file if it was a large file upload, but even if it wasn't uploading a large file and was partially through uploading a small file - it would cancel it so the object is never created.

I'm using the new option you added, cancelling the context manually on error, and just opting not to call Close if I encounter any issues. If this isn't how I should be using the new options/client. please let me know!

someone1 avatar Apr 16 '20 22:04 someone1

Automatically calling b2_cancel_large_file is fraught if, for example, the user doesn't want the large file cancelled, which is a reasonable thing to expect.

You should always call Close irrespective of whether you've cancelled the context or otherwise encountered an error.

kurin avatar Apr 19 '20 01:04 kurin

Hmmm - the problem I'm experiencing is if I start an upload and there's an error while writing it to b2, calling Close will save the partially written file - my preference is for the file to not be written in this situation.

For the GCS client, I just cancel the context.Context I passed into NewWriter and the object is never written to the bucket. Internally, the library is listening for the context to be cancelled and aborts the upload, cleaning up as required.

My test is somewhat contrived, but I basically try to io.Copy into the Writer from a source that will always error on Read - instead of calling w.Close() in the error handling code as recommended - I just return, cancelling the context.Context which seems to prevent the file from being written to b2. However, I can't tell if blazer is expecting this type of interaction and will abort the write as required if Close is never called.

Maybe there's an edge case here since io.Copy will never have started the upload so when I call Close, blazer calls init and creates an empty file in b2?

someone1 avatar Apr 20 '20 00:04 someone1

I'd need to think about whether we can infer from a cancelled context whether the user does or does not want to clean up the underlying file. On the one hand it would suck to get partway through a huge file just to have a ctrl-c or something undo all your work, but on the other hand as you say we don't want to leave partials lying around.

Today, if you cancel the context and call Close(), the file should be left in an unwritten state; it can't complete simpleWriteFile() or finishLargeFile() with a cancelled context. So you will have a partially written large file, if you had completed any chunks, but you shouldn't have a "completed" large or small file. Are you seeing this behavior instead?

kurin avatar May 01 '20 06:05 kurin