coldsnap
coldsnap copied to clipboard
Failed to put block: snapshot does not exist
Saw this error when trying to upload a ~2GB file. I'm not sure if there's a timing issue in coldsnap, where it needs to wait for some confirmation before uploading blocks, or if it's on the EBS side, or...
Failed to upload snapshot: Failed to put 9 blocks for snapshot 'snap-abcdef':
Failed to put block 2 for snapshot 'snap-abcdef': The snapshot 'snap-abcdef' does not exist.
Failed to put block 3 for snapshot 'snap-abcdef': The snapshot 'snap-abcdef' does not exist.
Failed to put block 35 for snapshot 'snap-abcdef': The snapshot 'snap-abcdef' does not exist.
Failed to put block 258 for snapshot 'snap-abcdef': The snapshot 'snap-abcdef' does not exist.
Failed to put block 514 for snapshot 'snap-abcdef': The snapshot 'snap-abcdef' does not exist.
Failed to put block 770 for snapshot 'snap-abcdef': The snapshot 'snap-abcdef' does not exist.
Failed to put block 1282 for snapshot 'snap-abcdef': The snapshot 'snap-abcdef' does not exist.
Failed to put block 1794 for snapshot 'snap-abcdef': The snapshot 'snap-abcdef' does not exist.
Failed to put block 2047 for snapshot 'snap-abcdef': The snapshot 'snap-abcdef' does not exist.
Another one today. Most of the same block numbers...
Failed to upload snapshot: Failed to put 7 blocks for snapshot 'snap-bcdefg':
Failed to put block 3 for snapshot 'snap-bcdefg': The snapshot 'snap-bcdefg' does not exist.
Failed to put block 258 for snapshot 'snap-bcdefg': The snapshot 'snap-bcdefg' does not exist.
Failed to put block 514 for snapshot 'snap-bcdefg': The snapshot 'snap-bcdefg' does not exist.
Failed to put block 770 for snapshot 'snap-bcdefg': The snapshot 'snap-bcdefg' does not exist.
Failed to put block 1282 for snapshot 'snap-bcdefg': The snapshot 'snap-bcdefg' does not exist.
Failed to put block 1794 for snapshot 'snap-bcdefg': The snapshot 'snap-bcdefg' does not exist.
Failed to put block 2047 for snapshot 'snap-bcdefg': The snapshot 'snap-bcdefg' does not exist.
While trying to reproduce this, I also saw the following error; not sure if it's related, but I'd guess not. I'm putting it here because the same fix might help, if we have to impose waits / retries.
Failed to put block 1258 for snapshot 'snap-x': Error during dispatch: connection closed before message completed
I ran 5 upload tests each with a 10MB file and a ~1GB file, each time rerunning coldsnap until I got a failure.
For the 10MB file, it failed after 31, 177, 68, 48, and 968 attempts respectively, all but one with the "snapshot does not exist" error; the last gave Failed to complete snapshot 'snap-x': The snapshot 'snap-x' does not exist.
, something I hadn't seen before.
For the 1GB file, it failed after 27, 7, 4, 33, and 130 attempts respectively, always with the "connection closed before message completed" error. I hadn't seen this before, but now it seems consistent, at least when running somewhat intensive tests with the bigger file. The connection could be closed for any number of reasons...
(For reference, I did also see Failed to start snapshot: Number of concurrent backups exceeded: Limit: 100.
a handful of times, which seems unrelated, so I just waited a bit and restarted things when I saw it.)
I'm going to try to instrument the error case a bit more thoroughly to see if I can get better clues about the state of the connection and the snapshot at the point of error. For example, we only keep the error message from the last retry for each block, and I'll try keeping all of those; I'm also going to check the result of the start_snapshot call, which we currently just check was successful, but there's a separate 'status' field inside that could be error
- you never know!
I did learn a few things from the extra instrumentation, running with the 1GB file.
Retry counts:
- Not quite half of runs required any retries, and typically just 1 retry for 1 block.
- If there are multiple blocks that require retries, the block numbers tended to be clumped, e.g. blocks 142, 143, 144, 146, 148, 154, 155, 157, and one outlier at 873. Another example: 133, 139, 147, 148, 150, 151, 153, 154, and 646. A handful more like this.
- 8 of 127 runs required a second retry for some block.
- 1 of 127 runs would have required a fourth retry, but we kill the run after 3.
Retry causes:
- Almost all of the retries were caused by a "connection closed".
- Two were caused by "Couldn't find AWS credentials"; I use an instance IAM role for creds, so it could have been a temporary issue there.
It'd be great to figure out why we get the connection closed errors. It's tempting to raise the retry limit to 5 to work around it.
Obviously, none of these errors are the same as the original one in this issue, where EBS thinks the snapshot ID doesn't exist. Increasing retries seems less likely to fix that issue, since it's almost surely a timing issue, and we're not waiting between retries. It's tempting to add a wait.
In the past, when faced with similar resource timing issues, we've used the strategy of building a separate client object, ideally one that talks to a different endpoint, and checking the existence of the resource with that. It's unfortunate but it might help here.
As a test, I bumped block retry count from 3 to 5, and added an increasing wait time after each each block retry - 1 second, 2, 3, 4, 5. I was finally able to catch the interesting case, "snapshot does not exist." (I don't think it was related to those changes, I just happened to catch it.)
As you'd expect, when you see that error, you see it a lot - hundreds of blocks continued to fail with that error after >15s of total delay, killing the upload. However, very near the end, a handful of later blocks did manage to succeed in uploading after 1 not-found error. (I only logged ones that had some kind of failure, not currently the status of every block, so there could have been more.)
This implies that we could get some mileage out of confirming we can describe the snapshot after starting it, before we start uploading. However, I'm still not confident that it's a linear ordering, meaning that once we see an upload without a not-found error, I think we can still see a not-found error. I believe rusoto uses a connection pool, and we're not necessarily sending requests on the same connection that saw the resource... but in combination with backoff-retries it may be enough.
[edit] Actually, I'm fairly sure the ordering isn't linear, and that the waits wouldn't be enough in combination. I had another failed run with only one block failing due to the snapshot not being found. The per-connection results seem pretty inconsistent.
We do still occasionally see this error, even with the retries from #56 - twice in the past two weeks, I believe, during CI runs.