wal-e
wal-e copied to clipboard
GCS ConnectionResetError results in incomplete basebackup
I am using wal-e to store to to GCS. My base backup totals about 13GB over 36 volumes. About half of the time one of these volumes fails to upload with an error. Here is the relevant portion of the log:
DETAIL: Building volume 15. STRUCTURED: time=2017-01-15T22:14:35.626527-00 pid=8885 wal_e.worker.upload INFO MSG: begin uploading a base backup volume DETAIL: Uploading to "gs://<BUCKET>/<HOSTNAME>/basebackups_005/base_00000001000008850000005A_01176752/tar_partitions/part_00000015.tar.lzo". STRUCTURED: time=2017-01-15T22:14:45.798346-00 pid=8885 oauth2client.transport INFO Attempting refresh to obtain initial access_token oauth2client.crypt DEBUG [REDACTED] oauth2client.client INFO Refreshing access_token wal_e.worker.upload INFO MSG: finish uploading a base backup volume DETAIL: Uploading to "gs://<BUCKET>/<HOSTNAME>/basebackups_005/base_00000001000008850000005A_01176752/tar_partitions/part_00000013.tar.lzo" complete at 2773.79KiB/s. STRUCTURED: time=2017-01-15T22:15:29.420285-00 pid=8885 wal_e.worker.upload INFO MSG: finish uploading a base backup volume DETAIL: Uploading to "gs://<BUCKET>/<HOSTNAME>/basebackups_005/base_00000001000008850000005A_01176752/tar_partitions/part_00000014.tar.lzo" complete at 3159.9KiB/s. STRUCTURED: time=2017-01-15T22:16:07.894131-00 pid=8885 root DEBUG Retrying request to url https://www.googleapis.com/upload/storage/v1/b/<BUCKET>/o?uploadType=resumable&name=%2F<HOSTNAME>%2Fbasebackups_005%2Fbase_00000001000008850000005A_01176752%2Ftar_partitions%2Fpart_00000015.tar.lzo&upload_id=AEnB2UqAV6CJUgHNF-1lUt5ioNhXU9X13JNkXmKOpCduPxjaXWiK646kWtY6CTPBMOv4VwLuHGYbo-zatj8czKO80gFki2xqWQ after exception ConnectionResetError wal_e.worker.upload INFO MSG: finish uploading a base backup volume DETAIL: Uploading to "gs://<BUCKET>/<HOSTNAME>/basebackups_005/base_00000001000008850000005A_01176752/tar_partitions/part_00000010.tar.lzo" complete at 3529.08KiB/s. STRUCTURED: time=2017-01-15T22:17:00.036806-00 pid=8885 root DEBUG Retrying request to url https://www.googleapis.com/upload/storage/v1/b/<BUCKET>/o?uploadType=resumable&name=%2F<HOSTNAME>%2Fbasebackups_005%2Fbase_00000001000008850000005A_01176752%2Ftar_partitions%2Fpart_00000015.tar.lzo&upload_id=AEnB2UqAV6CJUgHNF-1lUt5ioNhXU9X13JNkXmKOpCduPxjaXWiK646kWtY6CTPBMOv4VwLuHGYbo-zatj8czKO80gFki2xqWQ after exception BadStatusLine root DEBUG Retrying request to url https://www.googleapis.com/upload/storage/v1/b/<BUCKET>/o?uploadType=resumable&name=%2F<HOSTNAME>%2Fbasebackups_005%2Fbase_00000001000008850000005A_01176752%2Ftar_partitions%2Fpart_00000015.tar.lzo&upload_id=AEnB2UqAV6CJUgHNF-1lUt5ioNhXU9X13JNkXmKOpCduPxjaXWiK646kWtY6CTPBMOv4VwLuHGYbo-zatj8czKO80gFki2xqWQ after exception BadStatusLine root DEBUG Retrying request to url https://www.googleapis.com/upload/storage/v1/b/<BUCKET>/o?uploadType=resumable&name=%2F<HOSTNAME>%2Fbasebackups_005%2Fbase_00000001000008850000005A_01176752%2Ftar_partitions%2Fpart_00000015.tar.lzo&upload_id=AEnB2UqAV6CJUgHNF-1lUt5ioNhXU9X13JNkXmKOpCduPxjaXWiK646kWtY6CTPBMOv4VwLuHGYbo-zatj8czKO80gFki2xqWQ after exception BadStatusLine Traceback (most recent call last): File "/opt/wal-e/lib/python3.4/site-packages/wal_e/retries.py", line 62, in shim return f(*args, **kwargs) File "/opt/wal-e/lib/python3.4/site-packages/wal_e/worker/upload.py", line 140, in put_file_helper return self.blobstore.uri_put_file(self.creds, url, tf) File "/opt/wal-e/lib/python3.4/site-packages/wal_e/blobstore/gs/utils.py", line 42, in uri_put_file content_type=content_type) File "/opt/wal-e/lib/python3.4/site-packages/gcloud/storage/blob.py", line 543, in upload_from_file http_response = upload.stream_file(use_chunks=True) File "/opt/wal-e/lib/python3.4/site-packages/gcloud/streaming/transfer.py", line 1086, in stream_file response = send_func(self.stream.tell()) File "/opt/wal-e/lib/python3.4/site-packages/gcloud/streaming/transfer.py", line 1215, in _send_chunk return self._send_media_request(request, end) File "/opt/wal-e/lib/python3.4/site-packages/gcloud/streaming/transfer.py", line 1125, in _send_media_request self.bytes_http, request, retries=self.num_retries) File "/opt/wal-e/lib/python3.4/site-packages/gcloud/streaming/http_wrapper.py", line 423, in make_api_request check_response_func=check_response_func) File "/opt/wal-e/lib/python3.4/site-packages/gcloud/streaming/http_wrapper.py", line 371, in _make_api_request_no_retry redirections=redirections, connection_type=connection_type) File "/opt/wal-e/lib/python3.4/site-packages/oauth2client/transport.py", line 175, in new_request redirections, connection_type) File "/opt/wal-e/lib/python3.4/site-packages/oauth2client/transport.py", line 282, in request connection_type=connection_type) File "/opt/wal-e/lib/python3.4/site-packages/httplib2/init.py", line 1314, in request (response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey) File "/opt/wal-e/lib/python3.4/site-packages/httplib2/init.py", line 1064, in _request (response, content) = self._conn_request(conn, request_uri, method, body, headers) File "/opt/wal-e/lib/python3.4/site-packages/httplib2/init.py", line 1017, in _conn_request response = conn.getresponse() File "/usr/lib64/python3.4/http/client.py", line 1227, in getresponse response.begin() File "/usr/lib64/python3.4/http/client.py", line 386, in begin version, status, reason = self._read_status() File "/usr/lib64/python3.4/http/client.py", line 356, in _read_status raise BadStatusLine(line) http.client.BadStatusLine: ''
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/opt/wal-e/lib/python3.4/site-packages/gevent/greenlet.py", line 536, in run result = self._run(*self.args, **self.kwargs) File "/opt/wal-e/lib/python3.4/site-packages/wal_e/worker/upload.py", line 145, in call k = put_file_helper() File "/opt/wal-e/lib/python3.4/site-packages/wal_e/retries.py", line 72, in shim exc_processor_cxt=exc_processor_cxt) File "/opt/wal-e/lib/python3.4/site-packages/wal_e/retries.py", line 104, in retry_with_count_internal side_effect_func(exc_tup, exc_processor_cxt) File "/opt/wal-e/lib/python3.4/site-packages/wal_e/worker/upload.py", line 135, in log_volume_failures_on_error raise typ(value).with_traceback(tb) File "/opt/wal-e/lib/python3.4/site-packages/wal_e/retries.py", line 62, in shim return f(*args, **kwargs) File "/opt/wal-e/lib/python3.4/site-packages/wal_e/worker/upload.py", line 140, in put_file_helper return self.blobstore.uri_put_file(self.creds, url, tf) File "/opt/wal-e/lib/python3.4/site-packages/wal_e/blobstore/gs/utils.py", line 42, in uri_put_file content_type=content_type) File "/opt/wal-e/lib/python3.4/site-packages/gcloud/storage/blob.py", line 543, in upload_from_file http_response = upload.stream_file(use_chunks=True) File "/opt/wal-e/lib/python3.4/site-packages/gcloud/streaming/transfer.py", line 1086, in stream_file response = send_func(self.stream.tell()) File "/opt/wal-e/lib/python3.4/site-packages/gcloud/streaming/transfer.py", line 1215, in _send_chunk return self._send_media_request(request, end) File "/opt/wal-e/lib/python3.4/site-packages/gcloud/streaming/transfer.py", line 1125, in _send_media_request self.bytes_http, request, retries=self.num_retries) File "/opt/wal-e/lib/python3.4/site-packages/gcloud/streaming/http_wrapper.py", line 423, in make_api_request check_response_func=check_response_func) File "/opt/wal-e/lib/python3.4/site-packages/gcloud/streaming/http_wrapper.py", line 371, in _make_api_request_no_retry redirections=redirections, connection_type=connection_type) File "/opt/wal-e/lib/python3.4/site-packages/oauth2client/transport.py", line 175, in new_request redirections, connection_type) File "/opt/wal-e/lib/python3.4/site-packages/oauth2client/transport.py", line 282, in request connection_type=connection_type) File "/opt/wal-e/lib/python3.4/site-packages/httplib2/init.py", line 1314, in request (response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey) File "/opt/wal-e/lib/python3.4/site-packages/httplib2/init.py", line 1064, in _request (response, content) = self._conn_request(conn, request_uri, method, body, headers) File "/opt/wal-e/lib/python3.4/site-packages/httplib2/init.py", line 1017, in _conn_request response = conn.getresponse() File "/usr/lib64/python3.4/http/client.py", line 1227, in getresponse response.begin() File "/usr/lib64/python3.4/http/client.py", line 386, in begin version, status, reason = self._read_status() File "/usr/lib64/python3.4/http/client.py", line 356, in _read_status raise BadStatusLine(line) http.client.BadStatusLine: '' Sun Jan 15 23:04:41 2017 <Greenlet at 0x7fa817158a60: <wal_e.worker.upload.PartitionUploader object at 0x7fa81715f7b8>([ExtendedTarInfo(submitted_path='/var/lib/pgsql/9.)> failed with BadStatusLine
wal_e.operator.backup WARNING MSG: blocking on sending WAL segments DETAIL: The backup was not completed successfully, but we have to wait anyway. See README: TODO about pg_cancel_backup STRUCTURED: time=2017-01-15T23:04:41.279397-00 pid=8885 ERROR: a backup is not in progress wal_e.main ERROR MSG: Could not stop hot backup STRUCTURED: time=2017-01-15T23:04:41.298885-00 pid=8885
I've seen this error a few times but too sporadically to really look into it more. Since exception raises all the way from the python http client I guess we would need to open a ticket with google instead (or with the gcloud project).
One thing I can try it is upgrading the gcloud client we use. It seems sometimes in December they got rid of the old oauth2client so we should try that out first. I can try to spend some time to see if everything works with the newer gcloud-client but if your are in a hurry feel free to see if you can get the newer version to work as well!
It seems that switching to the newer client isn't that straightforward. I got most of the glitches fixed but I think I've stumbled upon a bug in the actual google cloud code. I'll try to debug a bit more and see if I need to start waiting for fixes upstream.
The nice part about upgrading the google cloud dependency is that we can get rid of the hacky monkey-patching of an threadsafe http client.
I think that reworking the retry mechanism to be more robust in dealing with transient failures would help a lot. Something like this is what I am thinking of: https://cloud.google.com/storage/docs/exponential-backoff
I worked around the immediate problem by allowing the client library to retry.
How did you setup the retries? I'm a bit sad if we need to add another level of retries since the google client already does retries. See the lines you pasted in the log:
root DEBUG Retrying request to url https://www.googleapis.com/upload/storage/v1/b//o?uploadType=resumable&name=%2F%2Fbasebackups_005%2Fbase_00000001000008850000005A_01176752%2Ftar_partitions%2Fpart_00000015.tar.lzo&upload_id=AEnB2UqAV6CJUgHNF-1lUt5ioNhXU9X13JNkXmKOpCduPxjaXWiK646kWtY6CTPBMOv4VwLuHGYbo-zatj8czKO80gFki2xqWQ after exception BadStatusLine
I'd be eager to first attempt updating the google cloud client since for some reason the error does not fix itself during those retries and I want to make sure it's not something that has been fixed in later versions. That being said, I was able to get all tests to pass with the new google cloud client. I'll test it for a few days to see how it copes. Would you be interested in testing that branch?
If nothing else helps I guess we need to add more retries, hence why I'm interested to know how you've worked around the issue yourself.
I don't think that my workaround is a good solution but it gets my backups into GCS. I modified blobstore/gs/utils.py so that blob.upload_from_file is called with num_retries=10 instead of num_retries=0. I think that leveraging the client libraries retry logic makes sense because it is easier to support resumable uploads and deal with platform specific issues but I understand that it doesn't fit well within the existing codebase with the use the the retry decorator.
I was afraid you would say that. :)
From the stacktrace it looks like upload_from_file
uses upload.stream_file
which never actually receives the num_retries given. Instead (as far as I can follow the google client code) it uses the default amount of retries (5) for an Upload object regardless. This could be a bug in the code or I'm just missing some magic that actually passes the value forward.
Either way if using a higher number for retries in the uploading fixes things I don't see any reason not to do it. The exception in this case is clearly something that the google cloud client should handle and I agree with you that we should leverage the client libraries functionality. The only thing that annoys me here is that we are changing a value that, for all I can see, should not make any difference.
What version of gcloud
is wal-e
locked at? BadStatusLine
seems to be a retry-able error as of Nov 30 2015 (or possibly earlier):
https://github.com/GoogleCloudPlatform/google-cloud-python/blame/0581c286c46ecc832fb5a3a80c674cd3b408c081/core/google/cloud/streaming/http_wrapper.py#L53
gcloud
seems to have been renamed google-cloud
and the last shipped version of gcloud
was 0.18.3. I checked out that tag and confirmed BadStatusLine
is considered a retryable error as well in 0.18.3.
It looks like 0.18.3 is in my virtualenv. By reading the code I don't know how but my expectation is that the the way that wal-e calls the client library with num_retries=0 disables retries. I think that @deverant puts forward that no matter what num_retries is set to it is effectively 5 by the time that it gets to the resumable upload.
You are correct that BadStatusLine is a retryable exception and looking at the log it has retried that 5 times.
I am testing a version of wal-e now that uses the new google-cloud
package. If it works we should see if it fixes these issues since a lot has changed in that code base regarding authentication and how the HTTP connections are handled.
This could also have something to do with the entire monkey patching of the httplib2. But let's try one thing at a time and see if we can find a fix.
I am looking to release 1.1 "sometime" with a minor change to (not) backing up some directories (e.g. #296, #278) , so that would be a good time to switch google libraries.
Also, it's not clear to me here, but when such a backup fails, does it list as a complete backup, i.e. does the "backup_sentinel.json" get uploaded? If so, it's a pretty serious bug, but if not, that's the entire point of the file: to make cancelled backups more or less harmless, i.e. they would not be matched by backup-list
.
As far as I've seen we do fail properly so no backup_sentinel.json file gets uploaded.
I've now tested the updated library enough to feel comfortable upgrading it. @fdr, do you wish to see the branch for the upgrade or shall I just merge to master waiting for 1.1 release?
When the backup-push fails due to this condition it does not show up in backup-list.
With regard to things not showing up in backup-list
: excellent. It's bad so many backups fail, but nobody will restore them thinking they ought to work.
@deverant, let's consider master
the 1.1 development branch: I pushed the change to the VERSION
file to signify that, in d47acef0039cd11671e631710faa8420d10a3ad9.
Can you open a pull request with your changes? I'd like to take a quick look at it. I promise I'll fasttrack it.
I made PR #310 for the updated libraries.
@bdurrow, any chance you can check out my branch and see if using the new libraries would fix the upload problem?
How do we feel about release v1.1 with this? It'd be nice to know that it fixes something meaningful before doing it.
Anyone confirm that the WAL-E beta with the new driver helps at all?
I've now seen successful backups on the terabyte range to GCS with the latest drivers. I'd like to close this unless someone can still experience problems with the current master branch.